I have a Stock class which loads lots of stock data history from a file (about 100 MB). I have a Pair class that takes two Stock objects and calculates some statistical relations between the two then writes the results to file.
In my main method I have a loop going through a list of pairs of stocks (about 500). It creates 2 stock objects and then a pair object out of the two. At this point the pair calculations are written to file and I'm done with the objects. I need to free the memory so I can go on with the next calculation.
I addition to setting the 3 objects to null I have added the following two lines at the end of the loop:
GC.Collect(GC.MaxGeneration);
GC.WaitForPendingFinalizers();
Stepping over theses two lines just seems to just free up 50 MB out of the 200-300 MB that is allocated per every loop iteration (viewing it from task manager).
The program does about eight or ten pairs before it gives me a system out of memory exception. The memory usage steadily increases until it crashes at about 1.5 GB. (This is an 8 GB machine running Win7 Ultimate)
I don't have much experience with garbage collection. Am I doing something wrong?
Here's my code since you asked: (note: program has two modes, 1> add mode in which new pairs are added to system. 2> regular mode which updates the pair files realtime based on filesystemwatcher events. The stock data is updated by external app called QCollector.)
This is the segment in MainForm which runs in Add Mode:
foreach (string line in PairList)
{
string[] tokens = line.Split(',');
stockA = new Stock(QCollectorPath, tokens[0].ToUpper());
stockB = new Stock(QCollectorPath, tokens[1].ToUpper());
double ratio = double.Parse(tokens[2]);
Pair p = new Pair(QCollectorPath, stockA, stockB, ratio);
// at this point the pair is written to file (constructor handles this)
// commenting out the following lines of code since they don't fix the problem
// stockA = null;
// stockB = null;
// p = null;
// refraining from forced collection since that's not the problem
// GC.Collect(GC.MaxGeneration);
// GC.WaitForPendingFinalizers();
// so far this is the only way i can fix the problem by setting the pair classes
// references to StockA and StockB to null
p.Kill();
}
I am adding more code as per request: Stock and Pair are subclasses of TimeSeries, which has the common functionality
public abstract class TimeSeries {
protected List<string> data;
// following create class must be implemented by subclasses (stock, pair, etc...)
// as each class is created differently, although their data formatting is identical
protected void List<string> Create();
// . . .
public void LoadFromFile()
{
data = new List<string>();
List<StreamReader> srs = GetAllFiles();
foreach (StreamReader sr in srs)
{
List<string> temp = new List<string>();
temp = TurnFileIntoListString(sr);
data = new List<string>(temp.Concat(data));
sr.Close()
}
}
// uses directory naming scheme (according to data month/year) to find files of a symbol
protected List<StreamReader> GetAllFiles()...
public static List<string> TurnFileIntoListString(StreamReader sr)
{
List<string> list = new List<string>();
string line;
while ((line = sr.ReadLine()) != null)
list.Add(line);
return list;
}
// this is the only mean to access a TimeSeries object's data
// this is to prevent deadlocks by time consuming methods such as pair's Create
public string[] GetListCopy()
{
lock (data)
{
string[] listCopy = new string[data.count];
data.CopyTo(listCopy);
return listCopy();
}
}
}
public class Stock : TimeSeries
{
public Stock(string dataFilePath, string symbol, FileSystemWatcher fsw = null)
{
DataFilePath = dataFilePath;
Name = symbol.ToUpper();
LoadFromFile();
// to update stock data when external app updates the files
if (fsw != null) fsw.Changed += FileSystemEventHandler(fsw_Changed);
}
protected void List<string> Create()
{
// stock files created by external application
}
// . . .
}
public class Pair : TimeSeries {
public Pair(string dataFilePath, Stock stockA, Stock stockB, double ratio)
{
// assign parameters to local members
// ...
if (FileExists())
LoadFromFile();
else
Create();
}
protected override List<string> Create()
{
// since stock can get updated by fileSystemWatcher's event handler
// a copy is obtained from the stock object's data
string[] listA = StockA.GetListCopy();
string[] listB = StockB.GetListCopy();
List<string> listP = new List<string>();
int i, j;
i = GetFirstValidBar(listA);
j = GetFirstValidBar(listB);
DateTime dtA, dtB;
dtA = GetDateTime(listA[i]);
dtB = GetDateTime(listB[j]);
// this hidden segment adjusts i and j until they are starting at same datetime
// since stocks can have different amount of data
while (i < listA.Count() && j < listB.Count)
{
double priceA = GetPrice(listA[i]);
double priceB = GetPrice(listB[j]);
double priceP = priceA * ratio - priceB;
listP.Add(String.Format("{0},{1:0.00},{2:0.00},{3:0.00}"
, dtA
, priceP
, priceA
, priceB
);
if (i < j)
i++;
else if (j < i)
j++;
else
{
i++;
j++;
}
}
}
public void Kill()
{
data = null;
stockA = null;
stockB = null;
}
}
Your memory leak is here:
if (fsw != null) fsw.Changed += FileSystemEventHandler(fsw_Changed);
The instance of the stock object will be kept in memory as long as the FileSystemWatcher is alive, since it is responding to an event of the FileSystemWatcher.
I think that you want to either implement that event somewhere else, or at some other point in your code add a:
if (fsw != null) fsw.Changed -= fsw_Changed;
Given the way that the code is written it might be possible that stock object is intended to be called without a FileSystemWatcher in cases where a bulk processing is done.
In the original code that you posted the constructors of the Stock classes were being called with a FileSystemWatcher. You have changed that now. I think you will find that now with a null FileSystemWatcher you can remove your kill and you will not have a leak since you are no longer listening to the fsw.Changed.
Related
I have a middleware telemetry handler, that has a method that awaits the execution of a request, and then tries to store some key data values from the response body into custom dimensions fields in application insights, so I can use graphana and potentially other 3rd party products to analyse my reponses.
public class ResponseBodyHandler : IResponseBodyHandler
{
private readonly ITelemetryPropertyHandler _telemetryPropertyHandler = new TelemetryPropertyHandler();
public void TransformResponseBodyDataToTelemetryData(RequestTelemetry requestTelemetry, string responseBody)
{
SuccessResponse<List<Booking>> result = null;
try
{
result = JsonConvert.DeserializeObject<SuccessResponse<List<Booking>>>(responseBody);
}
catch (Exception e)
{
Log.Error("Telemetry response handler, failure to deserialize response body: " + e.Message);
return;
}
_telemetryPropertyHandler.CreateTelemetryProperties(requestTelemetry, result);
}
}
public class TelemetryPropertyHandler : ITelemetryPropertyHandler
{
private readonly ILabelHandler _labelHandler = new LabelHandler();
public void CreateTelemetryProperties(RequestTelemetry requestTelemetry, SuccessResponse<List<Booking>> result)
{
Header bookingHeader = result?.SuccessObject?.FirstOrDefault()?.BookingHeader;
requestTelemetry?.Properties.Add("ResponseClientId", "" + bookingHeader?.ConsigneeNumber);
Line line = bookingHeader?.Lines.FirstOrDefault();
requestTelemetry?.Properties.Add("ResponseProductId", "" + line?.PurchaseProductID);
requestTelemetry?.Properties.Add("ResponseCarrierId", "" + line?.SubCarrierID);
_labelHandler.HandleLabel(requestTelemetry, bookingHeader);
requestTelemetry?.Properties.Add("ResponseBody", JsonConvert.SerializeObject(result));
}
}
Now, inside: _labelHandler.HandleLabel(requestTelemetry, bookingHeader);
It extracts an Image that is base64 encoded, chunks up the string in sizes of 8192 characters, and adds them to the Properties as: Image index 0 .. N (N being the total chunks)
I can debug and verify that the code works.
However, on application insights, the entire "request" entry, is missing, not just the custom dimensions.
I am assuming that this is due to a maximum size constraint, and I am likely trying to add more data than is "allowed", however, I can't for the life of me, find the documentation that enforces this restriction.
Can someone tell what rule I am breaking? so I can either truncate the image out, if it isn't possible to store that much data? Or if there is something else I am doing wrong?
I have validated, that my code works fine, as long as I truncate the data into a single Property, that of course only partially stores the Image. (Making said "feature" useless)
public class LabelHandler : ILabelHandler
{
private readonly IBase64Splitter _base64Splitter = new Base64Splitter();
public void HandleLabel(RequestTelemetry requestTelemetry, Header bookingHeader)
{
Label label = bookingHeader?.Labels.FirstOrDefault();
IEnumerable<List<char>> splitBase64String = _base64Splitter.SplitList(label?.Base64.ToList());
if (splitBase64String != null)
{
bool imageHandlingWorked = true;
try
{
int index = 0;
foreach (List<char> chunkOfImageString in splitBase64String)
{
string dictionaryKey = $"Image index {index}";
string chunkData = new string(chunkOfImageString.ToArray());
requestTelemetry?.Properties.Add(dictionaryKey, chunkData);
index++;
}
}
catch (Exception e)
{
imageHandlingWorked = false;
Log.Error("Error trying to store label in chunks: " + e.Message);
}
if (imageHandlingWorked && label != null)
{
label.Base64 = "";
}
}
}
}
The above code is responsible for adding the chunks to a requestTelemetry Property field
public class Base64Splitter : IBase64Splitter
{
private const int ChunkSize = 8192;
public IEnumerable<List<T>> SplitList<T>(List<T> originalList)
{
for (var i = 0; i < originalList.Count; i += ChunkSize)
yield return originalList.GetRange(i, Math.Min(ChunkSize, originalList.Count - i));
}
}
This is the specific method for creating a char list chunk of characters, that correspond to the application insights maximum size pr custom dimension field.
Here is an image of the truncated field being added, if I just limit myself to a single property, but truncate the base64 encoded value.
[I'm from Application Insights team]
You can find field limits documented here: https://learn.microsoft.com/en-us/azure/azure-monitor/app/data-model-request-telemetry
On ingestion side there is a limit of 64 * 1024 bytes for overall JSON payload (need to add it to documentation).
You're facing something different though - that custom dimensions are removed completely. Maybe SDK detects that 64kb is exceeded and "mitigates" it this way. Can you try to limit it to a little bit less than 64kb?
I need to find a way to read information out of a very big CSV file with unity. The file is approx. 15000*4000 entries with almost 200MB and could even be longer.
Just using ReadAllLines on the file does kind of work but as soon as I try to do any operation on it, it will crash. Here is the code I am using just counting all non zero values which already crashes it. It's okay if the code might need loading time but it shouldn't crash. I assume it's because I save everything in the memory and therefore flood my RAM? Any ideas how to fix this that it won't crash?
private void readCSV()
{
string[] lines = File.ReadAllLines("Assets/Datasets/testCsv.csv");
foreach (string line in lines)
{
List<string> values = new List<string>();
values = line.Split(',').ToList();
int i = 0;
foreach (string val in values)
{
if (val != "0")
{
i++;
}
}
}
}
As I already stated in your other question you should rather go with a streamed solution in order to not load the entire thing into memory at all.
Also both FileIO as well as string.Split are slow especially for soany entries! Rather use a background thread / async Task for this!
The next future possible issue in your case 15000*4000 entries means a total of 60000000 cells. Which is still fine. However, the maximum value of int is 2147483647 so if your file grows further it might break / behave unexpected => rather use e.g. uint or directly ulong to avoid that issue.
private async Task<ulong> CountNonZeroEntries()
{
ulong count = 0;
// Using a stream reader you can load the content into memory one line at a time
using(var sr = new StreamReader("Assets/Datasets/testCsv.csv"))
{
while(true)
{
var line = await sr.ReadLineAsync();
if(line == null) break;
var values = line.Split(',');
foreach(var v in values)
{
if(v != "0") count++;
}
}
}
return count;
}
And then of course you would need to wait for the result e.g. using
// If you declare Start as asnyc Unity automatically calls it asynchronously
private async void Start()
{
var count = await CountNonZeroEntries();
Debug.Log($"{count} cells are != \"0\".");
}
The same can be done using Linq a bit easier to write in my eyes
using System.Linq;
...
private Task<ulong> CountNonZeroEntries()
{
return File.ReadLines("Assets/Datasets/testCsv.csv").Select(line => line.Split(',')).Count(v => v != "0");
}
Also File.ReadLines doesn't load the entire content at once but rather a lazy enumerable so you can use Linq queries on them one by one.
Create a Visual C# application that displays the contents of the Teams.txt file in a ListBox control. When the user selects a team in the ListBox, the application should display the number of times that team has won the World Series in the time period from 1903 to 2012.
The two files used are Teams.txt , which contains a list of the names of teams that have won the Championship at least once, and WorldSeriesWinners.txt - this file contains a chronological list of the World Series winning teams from 1903 - 2012. The first line in the file is the nae of the team that won in 1903 and the last line is the name of the team that won in 2012. Note that the World Series was not played in 1904 or 1994.
This is the question that i'm having problems with. Actually in this question I must make use of class, but the code is not working
This is my code. I hope that you can help me find the problem
This is the class part
class WorldSeries
{
// Field
private string _wins; // The team's total number of wins.
// Constructor
public WorldSeries()
{
_wins = "";
}
// Wins property
public string Wins
{
get { return _wins; }
set { _wins = value; }
}
}
This is the rest of my code
// Variables
string teamName; // To hold the teams names.
private void ReadTeams()
{
try
{
// Local Variables
StreamReader inputTeams; //To read the file
// Open the Teams.txt file.
inputTeams = File.OpenText("Teams.txt");
// Read the file's contents.
while (!inputTeams.EndOfStream)
{
// Read a line and add it to the ListBox.
teamName = inputTeams.ReadLine();
lst_teams.Items.Add(teamName);
}
// Close the file.
inputTeams.Close();
}
catch (Exception ex)
{
// Display an error message.
MessageBox.Show(ex.Message);
}
}
private void GetTeamWin (WorldSeries worldSeries)
{
try
{
//Local Variables
int index=0; // Loop counter, initialized to 0.
int winCount = 0; // Accumulator, initialized to 0.
// Open the WorldSeriesWinners.txt file.
StreamReader inputFile=File.OpenText
("WorldSeriesWinners.txt")
// Create a List object to hold strings.
List<string> winnerList = new List<string>();
// Read the file's contents
while (!inputFile.EndOfStream)
{
// Read a line and add it to the list.
winnerList.Add(inputFile.ReadLine());
}
// Sort the items in the List.
winnerList.Sort();
while (index >=0)
{
// Search the team name in the List
index = winnerList.BinarySearch(teamName);
winCount++;
// Remove the team name from the List
winnerList.RemoveAt(index);
}
// Store the total number of wins of the team in the Wins
// parameter.
worldSeries.Wins = winCount.ToString();
// Clear the List
winnerList.Clear();
// Display the number of times the team has won.
lbl_results.Text = "The" + lst_teams.SelectedItem.ToString()
+ "has won the World Series" +
winCount + "time(s), between 1903 and 2012.";
}
catch (Exception ex)
{
// Display an error message.
MessageBox.Show(ex.Message);
}
}
private void btn_Exit_Click(object sender, EventArgs e)
{
// Close the file.
this.Close();
}
}
The number of teams wins is easily small enough to hold in memory. You can read the whole file once and store a dictionary of the team name to the number of wins in memory. Something like this:
Dictionary<string, int> numberOfWins =
File.ReadAllLines("WorldSeriesWinners.txt")
.GroupBy(t => t)
.ToDictionary(g => g.Key, g => g.Count() );
You could then have simple function that checked if the selected team was in this list and returned the no of wins if so, if not, zero:
private int GetNoOfWins(string teamName)
{
if (numberOfWins.ContainsKey(teamName))
{
return numberOfWins[teamName];
}
else
{
return 0
}
}
which could easily be used in your existing code:
int winCount = GetNoOfWins(lst_teams.SelectedItem.ToString());
lbl_results.Text = "The" + lst_teams.SelectedItem.ToString()
+ "has won the World Series" +
winCount + "time(s), between 1903 and 2012.";
I have SSIS package with a transformation script component. It loads about 460 rows then it stops to do the script component again ( I dont know why it does this) , ofc it creates my C# class variables again and "forgets" where it was the "first time it ran" poping out nulls for the varibles.
Is there anyway make the script component not run its self again after 460 rows ? My batch that I am pulling is 10000 so it cant be that.
And the weirdest thing of all is that after 3 times of run the package (without changing anything) it does everything right...
public class ScriptMain : UserComponent
{
string MarkToRem;
string TypeToRem;
string SerToRem;
int IDCnt;
public override void PreExecute()
{
base.PreExecute();
}
public override void PostExecute()
{
base.PostExecute();
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
MyOutputBuffer.AddRow();
if(Row.IncomingPrice == "Mark")
{
MarkToRem = Row.IncomingCode ; // Setting ver to remember the mark we are in
MyOutputBuffer.ID = Row.IncomingID.ToString();
MyOutputBuffer.Mark = MarkToRem;
MyOutputBuffer.Type = "";
MyOutputBuffer.Series = "";
MyOutputBuffer.Code = "";
MyOutputBuffer.Price = "";
MyOutputBuffer.Description = "Mark Verander";
}
else if( Row.IncomingPrice == "Sub")
{
TypeToRem = Row.IncomingCode; // Save our current Type
SerToRem = Row.IncomingCode; //Save our current Series
// ============ Output ========================
MyOutputBuffer.ID = Row.IncomingID.ToString();
MyOutputBuffer.Mark = MarkToRem;
MyOutputBuffer.Type = "";
MyOutputBuffer.Series = "";
MyOutputBuffer.Code = "";
MyOutputBuffer.Price = "";
MyOutputBuffer.Description = "Sub en series verander";
}
else if (Row.IncomingPrice == "Series")
{
SerToRem = Row.IncomingCode; //Save our current Series
// ============ Output ========================
MyOutputBuffer.ID = Row.IncomingID.ToString();
MyOutputBuffer.Mark = MarkToRem;
MyOutputBuffer.Type = "";
MyOutputBuffer.Series = SerToRem;
MyOutputBuffer.Code = "";
MyOutputBuffer.Price = "";
MyOutputBuffer.Description = "Series verander";
}
else
{
MyOutputBuffer.ID = Row.IncomingID.ToString();
MyOutputBuffer.Mark = MarkToRem;
MyOutputBuffer.Type = TypeToRem;
MyOutputBuffer.Series =SerToRem;
MyOutputBuffer.Code = Row.IncomingCode;
MyOutputBuffer.Price = Row.IncomingPrice;
MyOutputBuffer.Description = Row.IncomingDiscription;
}
IDCnt = IDCnt + 1;
}
}
The first 9 rows looks like this. For the incoming data
ID Code Price Discription
1 184pin DDR Mark
2 DDR - Non-ECC Sub
3 ME-A1GDV4 388 Adata AD1U400A1G3-R 1Gb ddr-400 ( pc3200 ) , CL3 - 184pin - lifetime warranty
4 ME-C512DV4 199 Corsair Valueselect VS512MB400 512mb ddr-400 ( pc3200 ) , CL2.5 - 184pin - lifetime warranty
5 ME-C1GDV4 399 Corsair Valueselect VS1GB400C3 1Gb ddr-400 ( pc3200 ) , CL3 - 184pin - lifetime warranty
6 240pin DDR2 Mark
7 DDR2 - Non-ECC Sub
8 Adata - lifetime warranty Series
9 ME-A2VD26C5 345 Adata AD2U667B2G5 Valuselect , 2Gb ddr2-667 ( pc2-5400 ) , CL5 , 1.8v - 240pin - lifetime warranty
Solved it.
Avoid Asynchronous Transformation wherever possible
SSIS runtime executes every task other than data flow task in the defined sequence. Whenever the SSIS runtime engine encounters a data flow task, it hands over the execution of the data flow task to data flow pipeline engine.
The data flow pipeline engine breaks the execution of a data flow task into one more execution tree(s) and may execute two or more execution trees in parallel to achieve high performance.
Synchronous transformations get a record, process it and pass it to the other transformation or destination in the sequence. The processing of a record is not dependent on the other incoming rows.
Whereas the asynchronous transformation requires addition buffers for its output and does not utilize the incoming input buffers. It also waits for all incoming rows to arrive for processing, that’s the reason the asynchronous transformation performs slower and must be avoided wherever possible. For example, instead of using Sort Transformation you can get sorted results from the source itself by using ORDER BY clause.
I'm building a console application that have to process a bunch of document.
To stay simple, the process is :
for each year between X and Y, query the DB to get a list of document reference to process
for each of this reference, process a local file
The process method is, I think, independent and should be parallelized as soon as input args are different :
private static bool ProcessDocument(
DocumentsDataset.DocumentsRow d,
string langCode
)
{
try
{
var htmFileName = d.UniqueDocRef.Trim() + langCode + ".htm";
var htmFullPath = Path.Combine("x:\path", htmFileName;
missingHtmlFile = !File.Exists(htmFullPath);
if (!missingHtmlFile)
{
var html = File.ReadAllText(htmFullPath);
// ProcessHtml is quite long : it use a regex search for a list of reference
// which are other documents, then sends the result to a custom WS
ProcessHtml(ref html);
File.WriteAllText(htmFullPath, html);
}
return true;
}
catch (Exception exc)
{
Trace.TraceError("{0,8}Fail processing {1} : {2}","[FATAL]", d.UniqueDocRef, exc.ToString());
return false;
}
}
In order to enumerate my document, I have this method :
private static IEnumerable<DocumentsDataset.DocumentsRow> EnumerateDocuments()
{
return Enumerable.Range(1990, 2020 - 1990).AsParallel().SelectMany(year => {
return Document.FindAll((short)year).Documents;
});
}
Document is a business class that wrap the retrieval of documents. The output of this method is a typed dataset (I'm returning the Documents table). The method is waiting for a year and I'm sure a document can't be returned by more than one year (year is part of the key actually).
Note the use of AsParallel() here, but I never got issue with this one.
Now, my main method is :
var documents = EnumerateDocuments();
var result = documents.Select(d => {
bool success = true;
foreach (var langCode in new string[] { "-e","-f" })
{
success &= ProcessDocument(d, langCode);
}
return new {
d.UniqueDocRef,
success
};
});
using (var sw = File.CreateText("summary.csv"))
{
sw.WriteLine("Level;UniqueDocRef");
foreach (var item in result)
{
string level;
if (!item.success) level = "[ERROR]";
else level = "[OK]";
sw.WriteLine(
"{0};{1}",
level,
item.UniqueDocRef
);
//sw.WriteLine(item);
}
}
This method works as expected under this form. However, if I replace
var documents = EnumerateDocuments();
by
var documents = EnumerateDocuments().AsParrallel();
It stops to work, and I don't understand why.
The error appears exactly here (in my process method):
File.WriteAllText(htmFullPath, html);
It tells me that the file is already opened by another program.
I don't understand what can cause my program not to works as expected. As my documents variable is an IEnumerable returning unique values, why my process method is breaking ?
thx for advises
[Edit] Code for retrieving document :
/// <summary>
/// Get all documents in data store
/// </summary>
public static DocumentsDS FindAll(short? year)
{
Database db = DatabaseFactory.CreateDatabase(connStringName); // MS Entlib
DbCommand cm = db.GetStoredProcCommand("Document_Select");
if (year.HasValue) db.AddInParameter(cm, "Year", DbType.Int16, year.Value);
string[] tableNames = { "Documents", "Years" };
DocumentsDS ds = new DocumentsDS();
db.LoadDataSet(cm, ds, tableNames);
return ds;
}
[Edit2] Possible source of my issue, thanks to mquander. If I wrote :
var test = EnumerateDocuments().AsParallel().Select(d => d.UniqueDocRef);
var testGr = test.GroupBy(d => d).Select(d => new { d.Key, Count = d.Count() }).Where(c=>c.Count>1);
var testLst = testGr.ToList();
Console.WriteLine(testLst.Where(x => x.Count == 1).Count());
Console.WriteLine(testLst.Where(x => x.Count > 1).Count());
I get this result :
0
1758
Removing the AsParallel returns the same output.
Conclusion : my EnumerateDocuments have something wrong and returns twice each documents.
Have to dive here I think
This is probably my source enumeration in cause
I suggest you to have each task put the file data into a global queue and have a parallel thread take writing requests from the queue and do the actual writing.
Anyway, the performance of writing in parallel on a single disk is much worse than writing sequentially, because the disk needs to spin to seek the next writing location, so you are just bouncing the disk around between seeks. It's better to do the writes sequentially.
Is Document.FindAll((short)year).Documents threadsafe? Because the difference between the first and the second version is that in the second (broken) version, this call is running multiple times concurrently. That could plausibly be the cause of the issue.
Sounds like you're trying to write to the same file. Only one thread/program can write to a file at a given time, so you can't use Parallel.
If you're reading from the same file, then you need to open the file with only read permissions as not to put a write lock on it.
The simplest way to fix the issue is to place a lock around your File.WriteAllText, assuming the writing is fast and it's worth parallelizing the rest of the code.