I'm working on a utility to read through a JSON file I've been given and to transform it into SQL Server. My weapon of choice is a .NET Core Console App (I'm trying to do all of my new work with .NET Core unless there is a compelling reason not to). I have the whole thing "working" but there is clearly a problem somewhere because the performance is truly horrifying almost to the point of being unusable.
The JSON file is approximately 27MB and contains a main array of 214 elements and each of those contains a couple of fields along with an array of from 150-350 records (that array has several fields and potentially a small <5 record array or two). Total records are approximately 35,000.
In the code below I've changed some names and stripped out a few of the fields to keep it more readable but all of the logic and code that does actual work is unchanged.
Keep in mind, I've done a lot of testing with the placement and number of calls to SaveChanges() think initially that number of trips to the Db was the problem. Although the version below is calling SaveChanges() once for each iteration of the 214-record loop, I've tried moving it outside of the entire looping structure and there is no discernible change in performance. In other words, with zero trips to the Db, this is still SLOW. How slow you ask, how does > 24 hours to run hit you? I'm willing to try anything at this point and am even considering moving the whole process into SQL Server but would much reather work in C# than TSQL.
static void Main(string[] args)
{
string statusMsg = String.Empty;
JArray sets = JArray.Parse(File.ReadAllText(#"C:\Users\Public\Downloads\ImportFile.json"));
try
{
using (var _db = new WidgetDb())
{
for (int s = 0; s < sets.Count; s++)
{
Console.WriteLine($"{s.ToString()}: {sets[s]["name"]}");
// First we create the Set
Set eSet = new Set()
{
SetCode = (string)sets[s]["code"],
SetName = (string)sets[s]["name"],
Type = (string)sets[s]["type"],
Block = (string)sets[s]["block"] ?? ""
};
_db.Entry(eSet).State = Microsoft.EntityFrameworkCore.EntityState.Added;
JArray widgets = sets[s]["widgets"].ToObject<JArray>();
for (int c = 0; c < widgets.Count; c++)
{
Widget eWidget = new Widget()
{
WidgetId = (string)widgets[c]["id"],
Layout = (string)widgets[c]["layout"] ?? "",
WidgetName = (string)widgets[c]["name"],
WidgetNames = "",
ReleaseDate = releaseDate,
SetCode = (string)sets[s]["code"]
};
// WidgetColors
if (widgets[c]["colors"] != null)
{
JArray widgetColors = widgets[c]["colors"].ToObject<JArray>();
for (int cc = 0; cc < widgetColors.Count; cc++)
{
WidgetColor eWidgetColor = new WidgetColor()
{
WidgetId = eWidget.WidgetId,
Color = (string)widgets[c]["colors"][cc]
};
_db.Entry(eWidgetColor).State = Microsoft.EntityFrameworkCore.EntityState.Added;
}
}
// WidgetTypes
if (widgets[c]["types"] != null)
{
JArray widgetTypes = widgets[c]["types"].ToObject<JArray>();
for (int ct = 0; ct < widgetTypes.Count; ct++)
{
WidgetType eWidgetType = new WidgetType()
{
WidgetId = eWidget.WidgetId,
Type = (string)widgets[c]["types"][ct]
};
_db.Entry(eWidgetType).State = Microsoft.EntityFrameworkCore.EntityState.Added;
}
}
// WidgetVariations
if (widgets[c]["variations"] != null)
{
JArray widgetVariations = widgets[c]["variations"].ToObject<JArray>();
for (int cv = 0; cv < widgetVariations.Count; cv++)
{
WidgetVariation eWidgetVariation = new WidgetVariation()
{
WidgetId = eWidget.WidgetId,
Variation = (string)widgets[c]["variations"][cv]
};
_db.Entry(eWidgetVariation).State = Microsoft.EntityFrameworkCore.EntityState.Added;
}
}
}
_db.SaveChanges();
}
}
statusMsg = "Import Complete";
}
catch (Exception ex)
{
statusMsg = ex.Message + " (" + ex.InnerException + ")";
}
Console.WriteLine(statusMsg);
Console.ReadKey();
}
I had an issue with that kind of code, lots of loops and tons of changing state.
Any change / manipulation you make in _db context, will generate a "trace" of it. And it making your context slower each time. Read more here.
The fix for me was to create new EF context(_db) at some key points. It saved me a few hours per run!
You could try to create a new instance of _db each iteration in this loop
contains a main array of 214 elements
If it make no change, try to add some stopwatch to get a best idea of what/where is taking so long.
If you're making thousands of updates then EF is not really the way to go. Something like SQLBulkCopy will do the trick.
You could try the bulkwriter library.
IEnumerable<string> ReadFile(string path)
{
using (var stream = File.OpenRead(path))
using (var reader = new StreamReader(stream))
{
while (reader.Peek() >= 0)
{
yield return reader.ReadLine();
}
}
}
var items =
from line in ReadFile(#"C:\products.csv")
let values = line.Split(',')
select new Product {Sku = values[0], Name = values[1]};
then
using (var bulkWriter = new BulkWriter<Product>(connectionString)) {
bulkWriter.WriteToDatabase(items);
}
Related
I'm currently working on a c# form.Basically, I have a lot of log files and most of them have duplicates lines between them. This form is supposed to concatenate a lot of those files into one file then delete all the duplicates in it so that I can have one log file without duplicates. I've already successfully made it work by taking 2 files, concatenating them, deleting all the duplicates in it then reproducing the process until I have no more files. Here is the function I made for this:
private static void DeleteAllDuplicatesFastWithMemoryManagement(HashSet<string>[] path_list, string parent_path, ProgressBar pBar1, BackgroundWorker backgroundWorker1)
{
for (int j = 0; j < path_list.Length; j++)
{
HashSet<string>.Enumerator em = path_list[j].GetEnumerator();
List<string> LogFile = new List<string>();
while (em.MoveNext())
{
var secondLogFile = File.ReadAllLines(em.Current);
LogFile = LogFile.Concat(secondLogFile).ToList();
LogFile = LogFile.Distinct().ToList();
backgroundWorker1.ReportProgress(1);
}
LogFile = LogFile.Distinct().ToList();
string new_path = parent_path + "/new_data/probe." + j + ".log";
File.WriteAllLines(new_path, LogFile.Distinct().ToArray());
}
}
path_list contains all the path to the files I need to process.
path_list[0] contains all the probe.0.log files
path_list[1] contains all the probe.1.log files ...
Here is the idea I have for my problem but I have no idea how to code it :
private static void DeleteAllDuplicatesFastWithMemoryManagement(HashSet<string>[] path_list, string parent_path, ProgressBar pBar1, BackgroundWorker backgroundWorker1)
{
for (int j = 0; j < path_list.Length; j++)
{
HashSet<string>.Enumerator em = path_list[j].GetEnumerator();
List<string> LogFile = new List<string>();
while (em.MoveNext())
{
// how I see it
if (currentMemoryUsage + newfile.Length > maximumProcessMemory) {
LogFile = LogFile.Distinct().ToList();
}
//end
var secondLogFile = File.ReadAllLines(em.Current);
LogFile = LogFile.Concat(secondLogFile).ToList();
LogFile = LogFile.Distinct().ToList();
backgroundWorker1.ReportProgress(1);
}
LogFile = LogFile.Distinct().ToList();
string new_path = parent_path + "/new_data/probe." + j + ".log";
File.WriteAllLines(new_path, LogFile.Distinct().ToArray());
}
}
I think this method will be much quicker, and it will adjust to any computer specs. Can anyone help me to make this work ? Or tell me if I'm wrong.
You are creating far too many lists and arrays and Distincts.
Just combine everything in a HashSet, then write it out
private static void CombineNoDuplicates(HashSet<string>[] path_list, string parent_path, ProgressBar pBar1)
{
var logFile = new HashSet<string>(1000); // pre-size your hashset to a suitable size
foreach (var paths in path_list)
{
logFile.Clear();
foreach (var path in paths)
{
var lines = File.ReadLines(file);
logFile.UnionWith(lines);
backgroundWorker1.ReportProgress(1);
}
string new_path = Path.Combine(parent_path, "new_data", "probe." + j + ".log");
File.WriteAllLines(new_path, logFile);
}
}
Ideally you should use async instead of BackgroundWorker which is deprecated. This also means you don't need to store a whole file in memory at once, except for the first one.
private static async Task CombineNoDuplicatesAsync(HashSet<string>[] path_list, string parent_path, ProgressBar pBar1)
{
var logFile = new HashSet<string>(1000); // pre-size your hashset to a suitable size
foreach (var paths in path_list)
{
logFile.Clear();
foreach (var path in paths)
{
using (var sr = new StreamReader(file))
{
string line;
while ((line = await sr.ReadLineAsync()) != null)
{
logFile.Add(line);
}
}
}
string new_path = Path.Combine(parent_path, "new_data", "probe." + j + ".log");
await File.WriteAllLinesAsync(new_path, logFile);
}
}
If you want to risk a colliding hash-code, you could cut down your memory usage even further by just putting the strings' hashes in a HashSet, then you can fully stream all files.
Caveat: colliding hash-codes are a distinct possibility, especially with many strings. Analyze your data to see fi you can risk this.
private static async Task CombineNoDuplicatesAsync(HashSet<string>[] path_list, string parent_path, ProgressBar pBar1)
{
var hashes = new HashSet<int>(1000); // pre-size your hashset to a suitable size
foreach (var paths in path_list)
{
hashes.Clear();
string new_path = Path.Combine(parent_path, "new_data", "probe." + j + ".log");
using (var output = new StreamWriter(new_path))
{
foreach (var path in paths)
{
using (var sr = new StreamReader(file))
{
string line;
while ((line = await sr.ReadLineAsync()) != null)
{
if (hashes.Add(line.GetHashCode())
await output.WriteLineAsync(line);
}
}
}
}
}
}
You can get even more performance if you would read Span<byte> arrays and parse the lines like that, I will leave that as an exercise to the reader as it's quite complex.
Assuming your log files already contain lines that are sorted in chronological order1, we can effectively treat them as intermediate files for a multi-file sort and perform merging/duplicate elimination in one go.
It would be a new class, something like this:
internal class LogFileMerger : IEnumerable<string>
{
private readonly List<IEnumerator<string>> _files;
public LogFileMerger(HashSet<string> fileNames)
{
_files = fileNames.Select(fn => File.ReadLines(fn).GetEnumerator()).ToList();
}
public IEnumerator<string> GetEnumerator()
{
while (_files.Count > 0)
{
var candidates = _files.Select(e => e.Current);
var nextLine = candidates.OrderBy(c => c).First();
for (int i = _files.Count - 1; i >= 0; i--)
{
while (_files[i].Current == nextLine)
{
if (!_files[i].MoveNext())
{
_files.RemoveAt(i);
break;
}
}
}
yield return nextLine;
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
You can create a LogFileMerger using the set of input log file names and pass it directly as the IEnumerable<string> to some method like File.WriteAllLines. Using File.ReadLines should mean that the amount of memory being used for each input file is just a small buffer on each file, and we never attempt to have all of the data from any of the files loaded at any time.
(You may want to adjust the OrderBy and comparison operations in the above if there are requirements around case insensitivity but I don't see any evident in the question)
(Note also that this class cannot be enumerated multiple times in the current design. That could be adjusted by storing the paths instead of the open enumerators in the class field and making the list of open enumerators a local inside GetEnumerator)
1If this is not the case, it may be more sensible to sort each file first so that this assumption is met and then proceed with this plan.
I am using a basic streamreader to loop through a csv file of about 65gb (450 million rows).
using (sr = new StreamReader(currentFileName))
{
string headerLine = sr.ReadLine(); // skip the headers
while ((string currentTick = sr.ReadLine()) != null)
{
string[] tickValue = currentTick.Split(',');
// Ticks are formatted and added to the array in order to insert them afterwards.
}
}
This creates a list that will hold the ticks that belong to a candle and than call the insertTickBatch function.
private async static Task insertTickBatch(List<Tick> ticks)
{
if (ticks != null && ticks.Any())
{
using (DatabaseEntities db = new DatabaseEntities())
{
db.Configuration.LazyLoadingEnabled = false;
int currentCandleId = ticks.First().CandleId;
var candle = db.Candles.Where(c => c.Id == currentCandleId).FirstOrDefault();
foreach (var curTick in ticks)
{
candle.Ticks.Add(curTick);
}
await db.SaveChangesAsync();
db.Dispose();
Thread.Sleep(10);
}
}
}
This however takes about 15 years to complete and my intention is to speed this up. How do I achieve this?
I am not sure which EF you are using, but if available try this instead of your foreach loop:
db.Ticks.AddRange(ticks);
Also, CSVHelper is a nice package that can convert your entire file into an Tick object list, and of course the Thread.Sleep has to go.
Backstory: I'm generating csv files as reports, and is testing to see what happens if multiple big reports are generated, this below should generate around 4 MB csv file, so if I called to get this report 2 times my pc would throttle at my 16 GB of ram, even after getting the files the program still uses all my ram, only by restarting the program can I clear up my ram. The ram is mostly used on boilerplate of the SMS class.
My issue is that the tmp list is never cleared/cleaned up by the garbage collector, even after the controller call is finished, which ends in large amounts of ram getting used.
I can see that the ram usage is only increasing when generating the tmp list not when creating the csv file
The console output
Visual studio Memory diagnostic after clicking download once
Snapshot of memory
private static readonly Random random = new();
private string GenerateString(int length = 30)
{
StringBuilder str_build = new StringBuilder();
Random random = new Random();
char letter;
for (int i = 0; i < length; i++)
{
double flt = random.NextDouble();
int shift = Convert.ToInt32(Math.Floor(25 * flt));
letter = Convert.ToChar(shift + 65);
str_build.Append(letter);
}
return str_build.ToString();
}
[HttpGet("SMS")]
public async Task<ActionResult> GetSMSExport([FromQuery]string phoneNumber)
{
Console.WriteLine($"Generating items: {DateTime.Now.ToLongTimeString()}");
Console.WriteLine($"Finished generating items: {DateTime.Now.ToLongTimeString()}");
Console.WriteLine($"Generating CSV: {DateTime.Now.ToLongTimeString()}");
var tmp = new List<SMS>();
// tmp never gets cleared by the garbage collector, even if its not used after the call is finished
for (int i = 0; i < 1000000; i++)
{
tmp.Add(new SMS() {
GatewayName = GenerateString(),
Message = GenerateString(),
Status = GenerateString()
});
}
// 1048574 is max, excel says 1048576 is max but because of header and seperater line it needs to be minussed with 2
ActionResult csv = await ExportDataAsCSV(tmp, $"SMS_Report.csv");
Console.WriteLine($"Finished generating CSV: {DateTime.Now.ToLongTimeString()}");
return csv;
}
private async Task<ActionResult> ExportDataAsCSV(IEnumerable<object> listToExport, string fileName)
{
Console.WriteLine("Creating file");
if (listToExport is null || !listToExport.Any())
throw new ArgumentNullException(nameof(listToExport));
System.IO.File.Delete("Reports/" + GenerateString() + fileName);
var file = System.IO.File.Create("Reports/" + GenerateString() + fileName, 4096, FileOptions.DeleteOnClose);
var streamWriter = new StreamWriter(file, Encoding.UTF8);
await streamWriter.WriteAsync("sep=;");
await streamWriter.WriteAsync(Environment.NewLine);
var headerNames = listToExport.First().GetType().GetProperties();
foreach (var header in headerNames)
{
var displayAttribute = header.GetCustomAttributes(typeof(System.ComponentModel.DataAnnotations.DisplayAttribute),true);
if (displayAttribute.Length != 0)
{
var attribute = displayAttribute.Single() as System.ComponentModel.DataAnnotations.DisplayAttribute;
await streamWriter.WriteAsync(sharedLocalizer[attribute.Name] + ";");
}
else
await streamWriter.WriteAsync(header.Name + ";");
}
await streamWriter.WriteAsync(Environment.NewLine);
var newListToExport = listToExport.ToArray();
for (int j = 0; j < newListToExport.Length; j++)
{
object item = newListToExport[j];
var itemProperties = item.GetType().GetProperties();
for (int i = 0; i < itemProperties.Length; i++)
{
await streamWriter.WriteAsync(itemProperties[i].GetValue(item)?.ToString() + ";");
}
await streamWriter.WriteAsync(Environment.NewLine);
}
Helpers.LogHelper.Log(Helpers.LogHelper.LogType.Info, GetType(), $"User {User.Identity.Name} downloaded {fileName}");
await file.FlushAsync();
file.Position = 0;
return File(file, "text/csv", fileName);
}
public class SMS
{
[Display(Name = "Sent_at_text")]
public DateTime? SentAtUtc { get; set; }
[Display(Name = "gateway_name")]
public string GatewayName { get; set; }
[Display(Name = "message_title")]
[JsonProperty("messageText")]
public string Message { get; set; }
[Display(Name = "status_title")]
[JsonProperty("statusText")]
public string Status { get; set; }
}
Answer can be found in this post
MVC memory issue, memory not getting cleared after controller call is finished (Example project included)
Copied answer
The github version is a fixed version for this issue, so you can go explorer what i did in the changeset
Notes:
After generating a large file, you might need to download smaller sized files before C# releases the memory
Adding forced garbage collection helped alot
Adding a few using statements also helped alot
Smaller existing issues
If your object cant fit in the RAM and it start filling the pagefile it will not reduce the pagefile after use(Restarting your pc will help a little on it but wont clear the pagefile entirely)
I couldnt get it below 400MB of ram usage no matter what i tried, but it didnt matter if i had 5GB or 1GB in the RAM it would still get reduced down to ~400 MB
I am writing a small data migration tools from one big database to another small database. All of the others data migration method worked satisfactorily, but the following method has given an exception from the SKIP VALUE IS 100. I run this console script remotely as well as inside of the source server also. I tried in many different was to find the actual problem what it is. After then I found that only from the SKIP VALUE IS 100 it is not working for any TAKE 1,2,3,4,5 or ....
Dear expertise, I don't have any prior knowledge on that type of problem. Any kind of suggestions or comments is appreciatable to resolve this problem. Thanks for you time.
I know this code is not clean and the method is too long. I just tried solve this by adding some line of extra code. Because the problem solving is my main concern. I just copy past the last edited method.
In shot the problem I can illustrate with this following two line
var temp = queryable.Skip(90).Take(10).ToList(); //no exception
var temp = queryable.Skip(100).Take(10).ToList(); getting exception
private static void ImporterDataMigrateToRmgDb(SourceDBEntities sourceDb, RmgDbContext rmgDb)
{
int skip = 0;
int take = 10;
int count = sourceDb.FormAs.Where(x=> x.FormAStateId == 8).GroupBy(x=> x.ImporterName).Count();
Console.WriteLine("Total Possible Importer: " + count);
for (int i = 0; i < count/take; i++)
{
IOrderedQueryable<FormA> queryable = sourceDb.FormAs.Where(x => x.FormAStateId == 8).OrderBy(x => x.ImporterName);
List<IGrouping<string, FormA>> list;
try
{
list = queryable.Skip(skip).Take(take).GroupBy(x => x.ImporterName).ToList();
//this line is getting timeout exception from the skip value of 100.
}
catch (Exception exception)
{
Console.WriteLine(exception.Message);
sourceDb.Dispose();
rmgDb.Dispose();
sourceDb = new SourceDBEntities();
rmgDb = new RmgDbContext();
skip += take;
continue;
}
if (list.Count > 0)
{
foreach (var l in list)
{
List<FormA> formAs = l.ToList();
FormA formA = formAs.FirstOrDefault();
if (formA == null) continue;
Importer importer = formA.ConvertToRmgImporterFromFormA();
Console.WriteLine(formA.FormANo + " " + importer.Name);
var importers = rmgDb.Importers.Where(x => x.Name.ToLower() == importer.Name.ToLower()).ToList();
//bool any = rmgDb.Importers.Any(x => x.Name.ToLower() == formA.ImporterName.ToLower());
if (importers.Count() == 1)
{
foreach (var imp in importers)
{
Importer entity = rmgDb.Importers.Find(imp.Id);
entity.Country = importer.Country;
entity.TotalImportedAmountInUsd = importer.TotalImportedAmountInUsd;
rmgDb.Entry(entity).State = EntityState.Modified;
}
}
else
{
rmgDb.Importers.Add(importer);
}
rmgDb.SaveChanges();
Console.WriteLine(importer.Name);
}
}
skip += take;
}
Console.WriteLine("Importer Data Migration Completed");
}
I have fixed my problem by modifying following code
var queryable =
sourceDb.FormAs.Where(x => x.FormAStateId == 8)
.Select(x => new Adapters.ImporterBindingModel()
{
Id = Guid.NewGuid().ToString(),
Active = true,
Created = DateTime.Now,
CreatedBy = "System",
Modified = DateTime.Now,
ModifiedBy = "System",
Name = x.ImporterName,
Address = x.ImporterAddress,
City = x.City,
ZipCode = x.ZipCode,
CountryId = x.CountryId
})
.OrderBy(x => x.Name);
I am fairly new to Parse and are trying to switch to parse from an php/mysql solution. I am making a multiplayer game with Unity3D in which you can play one on one. My issue concerns the game menu. One game consists of 3 objects. 1 main game object (class:Games) and then 2 player objects (class:GamePlayers). These 2 objects have 2 different references/relations to other objects. 1 to the gameObjectId and one to the playerObjectId (class:Players). I have not used the pointer or relation for this course I am not sure how to use them?
Now, to get all the games a player is involved in I call the GamePlayers class to find all objects in which the current players objectId is.
void getGames(string playerObjectId){
var query = ParseObject.GetQuery("GamePlayers")
.WhereEqualTo("playerObjectId", playerObjectId);
query.FindAsync().ContinueWith(t => {
if (t.IsFaulted){
} else {
IEnumerable<ParseObject> results = t.Result;
foreach (var result in results) {
getGameData(result.Get<string>("gameObjectId"), playerObjectId, result.Get<string>("playerTiles"), result.Get<int>("playerTurn"));
}
}
});
}
From this I call another method (getGameData) in which I would like to get the game data and opponent game data for each game.
void getGameData(string gameObjectId, string playerObjectId, string playertiles, int playerTurn){
string bagtiles = "";
string tabletiles = "";
string newtiles = "";
int lastdraw = 0;
int sqlid = 0;
string oppObjectId = "";
string oppDataReturn = "";
var query = ParseObject.GetQuery("Games")
.WhereEqualTo("objectId", gameObjectId);
query.FindAsync().ContinueWith(t => {
if (t.IsFaulted){
} else {
IEnumerable<ParseObject> results = t.Result;
foreach (var result in results) {
bagtiles = result.Get<string>("bagtiles");
tabletiles = result.Get<string>("tabletiles");
newtiles = result.Get<string>("newtiles");
sqlid = result.Get<int>("sqlId");
}
}
});
query = ParseObject.GetQuery("GamePlayers")
.WhereEqualTo("gameObjectId", gameObjectId)
.WhereNotEqualTo("playerObjectId", playerObjectId);
query.FindAsync().ContinueWith(t => {
if (t.IsFaulted){
} else {
IEnumerable<ParseObject> results = t.Result;
foreach (var result in results) {
oppObjectId = result.Get<string>("playerObjectId");
lastdraw = result.Get<int>("lastDraw");
}
}
});
}
Now, all I am missing is the opponents playerdata e.g. username etc.
var query = ParseObject.GetQuery("Players")
.WhereEqualTo("objectId", oppObjectId);
query.FindAsync().ContinueWith(t => {
if (t.IsFaulted){
} else {
IEnumerable<ParseObject> results = t.Result;
foreach (var result in results) {
oppName = result.Get<string>("username");
}
}
});
I though about running yet another method but it seems like this could be done more effectually and maybe in one query?!?
At the end I add each game to my gamelist like so:
if(playerTurn == 0){
myturn.Add(gameObjectId+"^"+sqlid+"^"+bagtiles+"^"+tabletiles+"^"+newtiles+"^"+playertiles+"^"+lastdraw+"^"+oppObjectId);
} else if(playerTurn == 1){
theirturn.Add(gameObjectId+"^"+sqlid+"^"+bagtiles+"^"+tabletiles+"^"+newtiles+"^"+playertiles+"^"+lastdraw+"^"+oppObjectId);
} else if(playerTurn > 1){
finnished.Add(gameObjectId+"^"+sqlid+"^"+bagtiles+"^"+tabletiles+"^"+newtiles+"^"+playertiles+"^"+lastdraw+"^"+oppObjectId);
}
Hope this makes sense and somebody can guide me in the right direction. I am not an expert in C# but am learning ;-)
Thanks in advance :-)