So I'm doing Google Code Jam, and for their new format I have to upload my code as a single text file.
I like writing my code as properly constructed classes and multiple files even when under time pressure (I find that I save more time in clarity and my own debugging speed than I lose in wasted time.) and I want to re-use the common code.
Once I've got my code finished I have to convert from a series of classes in multiple files, to a single file.
Currently I'm just manually copying and pasting all the files' text into a single file, and then manually massaging the usings and namespaces to make it all work.
Is there a better option?
Ideally a tool that will JustDoIt for me?
Alternatively, if there were some predictable algorithm that I could implement that wouldn't require any manual tweaks?
Write your classes so that all "using"s are inside "namespace"
Write a script which collects all *.cs files and concatenates them
This is probably not the most optimal way to do this but this is a algorithm which can do what you need:
loop through every file and grab every line starting with "using" -> write them to a temp file/buffer
check for duplicates and remove them
get the position of the first '{' after the charsequence "namespace"
get the position of the last '}' in the file
append the text in between these two positions onto a temp file/buffer
append the second file/buffer to the first one
write out the merged buffer
It is very subjective. I see the algorithm as the following in pseudo code:
usingsLines = new HashSet<string>();
newFile = new StringBuilder();
foreeach (file in listOfFiles)
{
var textFromFile = file.ReadToEnd();
var usingOperators = textFromFile.GetUsings();
var fileBody = textFromFile.GetBody();
newFile+=fileBody ;
}
newFile = usingsLines.ToString() + newFile;
// As a result if will have something like this
// using usingsfromFirstFile;
// using usingsfromSecondFile;
//
// namespace FirstFileNamespace
// {
// ...
// }
//
// namespace SecondFileNamespace
// {
// ...
// }
But keep in mind this approach can lead to conflicts in namespaces if two different namespaces contan the same classes etc. To solve it you need to fix it manually, or rid of using operator and use fullnames with namespaces.
Also these few links may be useful:
Merge files,
Merge file in Java
Related
So I've been using the same code about a year now and normally I find new ways to do old tasks and slowly improve but I just seemed to of stagnated with this. I was curious if anyone could provide any insight on how I would do this task differently. I'm loading in a text file, reading all its lines into a string array and then looping those entries to perform a operation on each line.
string[] config = File.ReadAllLines("Config.txt");
foreach (string line in config)
{
DoOperations(line);
}
Eventually I'll just be moving to openfiledialog, but that's for a time in the future and using OFG on a console application that's multi threaded seems like bad practice.
Since you don't act on the whole file at any point you could read it one line at a time. Given that your file looks like a config it's probably not a massive file, but if you were trying to read a large file in using File.ReadAllLines() you can get into memory issues. Reading one line at a time helps avoid that.
using (StreamReader file = new StreamReader("config.txt")){
while((line = file.ReadLine()) != null)
{
DoOperations(line);
}
}
You could rename config to lines for readability ;)
You could use var
Select? (if DoSomething returns something)
var parsed = File.ReadAllLines("Config.txt").Select( l => Parsed(line));
ForEeach?
lines.ToList().ForEach( l => DoSomething(line));
Read line by line with ReadLines?
foreach (var line in File.ReadLines("Config.txt"))
{
(...)
}
The ReadLines and ReadAllLines methods differ as follows: When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned; when you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array. Therefore, when you are working with very large files, ReadLines can be more efficient.
Is it possible to parse only let's say the first half of the file with antlr4?
I am parsing large files and I am using UnbufferedCharStream and UnbufferedTokenStream.
I am not building a parse tree and I am using parse actions instead of visitor/listener patterns. With these I was able to save a significant amount of RAM and improve the parse speed.
However it still takes around 15s to parse the whole file. The parsed file is divided into two sections. The first half of the file has metadata, the second one is the actual data. The majority of the time is spent in the data section as there are more than 3m. lines to be parsed. The metadata section has only around 20,000 lines. Is it possible to parse only the first half, which would improve parse speed significantly? Is it possible to inject EOF manually after the metadata section?
How about dividing the file into two?
How about you programatically extract only the part you want to parse and create a new tmp.extension file that you parse? It could look like this:
System.IO.File.WriteAllText(#"C:\Users\Path\tmp.extension", text);
After the parsing you can delete the tmp file and the original stays as it is.
System.IO.File.Delete(#"C:\Users\Path\tmp.extension");
ANTLR4 creates recursive-decent parsers, with parse functions that can directly be invoked. Assume you have a grammar like this:
grammar t;
start: meta data EOF;
meta: x y z;
data: a b c+;
Your natural entry point would be the start rule (in your case that would be the rule for the entire file). But it's also possible to only invoke rule meta, which in your case could be the header part of the file. If you don't end this rule with EOF, your parser will just consume enough input to parse this particular part of the entire file.
So, i was able to find a solution. I overrode the Emit method from the generated lexer
so it finds the beginning of the second section and it manually injects EOF token,
like this:
public override IToken Emit()
{
string tokenText = base.Text;
if (this.metaDataOnly && tokenText == "DATA")
return base.EmitEOF();
return base.Emit();
}
I did check to see if any existing questions matched mine but I didn't see any, if I did, my mistake.
I have two text files to compare against each other, one is a temporary log file that is overwritten sometimes, and the other is a permanent log, which will collect and append all of the contents of the temp log into one file (it will collect new lines in the log since when it last checked and append the new lines to the end of the complete log). However after a point this may lead to the complete log becoming quite large and therefore not so efficient to compare against so i have been thinking about different methods to approach this.
my first idea is to "buffer" the temp log (being that it will normally be the smaller of the two) strings into a list and simply loop through the archive log and do something like:
List<String> bufferedlines = new List<string>();
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
if (bufferedlines.Contains(ArchiveStream.ReadLine()))
{
}
}
Now there is a couple of ways I could proceed from here, I could create yet another list to store the inconsistencies, close the read stream (I'm not sure you can both read and write at the same time, if you can that might make things easier for my options) then open a write stream in append mode and write the list to the file. alternatively, cutting out the buffering the inconsistencies, i could open a write stream while the files are being compared and on the spot write the lines that aren't matched.
The other method i could think of was limited by my knowledge of whether it could be done or not, which was rather than buffer either file, compare the streams side by side as they are read and append the lines on the fly. Something like:
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
using (StreamReader templogStream = new StreamReader(tempPath))
{
if (!(ArchiveStream.ReadAllLines.Contains(TemplogStream.ReadLine())))
{
//write the line to the file
}
}
}
As I said I'm not sure whether that would work or that it may be more efficient than the first method, so i figured i'd ask, see if anyone had insight into how this might properly be implemented, and whether it was the most efficient way or there was a better method out there.
Effectively what you want here is all of the items from one set that aren't in another set. This is set subtraction, or in LINQ terms, Except. If your data sets were sufficiently small you could simply do this:
var lines = File.ReadLines(TempPath)
.Except(File.ReadLines(ArchivePath))
.ToList();//can't write to the file while reading from it
File.AppendAllLines(ArchivePath, lines);
Of course, this code requires bringing the all of the lines in the temp file into memory, because that's just how Except is implemented. It creates a HashSet of all of the items so that it can efficiently find matches from the other sequence.
Presumably here the number of lines that need to be added here is pretty small, so the fact that the lines that we find here all need to be stored in memory isn't a problem. If there will potentially be a lot the, you'd want to write them out to another file besides the first one (possibly concatting the two files together when done, if needed).
var fileList = Directory.GetFiles("./", "split*.dat");
int fileCount = fileList.Length;
int i = 0;
foreach (string path in fileList)
{
string[] contents = File.ReadAllLines(path); // OutOfMemoryException
Array.Sort(contents);
string newpath = path.Replace("split", "sorted");
File.WriteAllLines(newpath, contents);
File.Delete(path);
contents = null;
GC.Collect();
SortChunksProgressChanged(this, (double)i / fileCount);
i++;
}
And for file that consists ~20-30 big lines(every line ~20mb) I have OutOfMemoryException when I perform ReadAllLines method. Why does this exception raise? And how do I fix it?
P.S. I use Mono on MacOS
You should always be very careful about performing operations with potentially unbounded results. In your case reading a file. As you mention, the file size and or line length is unbounded.
The answer lies in reading 'enough' of a line to sort then skipping characters until the next line and reading the next 'enough'. You probably want to aim to create a line index lookup such that when you reach an ambiguous line sorting order you can go back to get more data from the line (Seek to file position). When you go back you only need to read the next sortable chunk to disambiguate the conflicting lines.
You may need to think about the file encoding, don't go straight to bytes unless you know it is one byte per char.
The built in sort is not as fast as you'd like.
Side Note:
If you call GC.* you've probably done it wrong
setting contents = null does not help you
If you are using a foreach and maintaining the index then you may be better with a for(int i...) for readability
Okay, let me give you a hint to help you with your home work. Loading the complete file into memory will -as you know- not work, because it is given as a precondition of the assignment. You need to find a way to lazily load the data from disk as you go and throw as much data away as soon as possible. Because single lines could be too big, you will have to do this one char at a time.
Try creating a class that represents an abstraction over a line, for instance by wrapping the starting index and ending index of that line. When you let this class implement IComparable<T> it allows you to sort that line with other lines. Again, the trick is to be able to read characters from the file one at a time. You will need to work with Streams (File.Open) directly.
When you do this, you will be able to write your application code like this:
List<FileLine> lines = GetLines("fileToSort.dat");
lines.Sort();
foreach (var line in lines)
{
line.AppendToFile("sortedFile.dat");
}
Your task will be to implement GetLines(string path) and create the FileLine class.
Note that I assume that the actual number of lines will be small enough that the List<FileLine> will fit into memory (which means an approximate maximum of 40,000,000 lines). If the amount of lines can be higher, you would even need a more flexible approach, but since you are talking about 20 to 30 lines, this should not be a problem.
Basically you rapproach is bull. You are violatin a constraint of the homework you are given, and this constraint has been put there to make you think more.
As you said:
I must implement external sort and show my teacher that it works for files bigger than my
RAM
Ok, so how you think you will ever read the file in ;) this is there on purpose. ReadAllLiens does NOT implement incremental external sort. As a result, it blows.
Need a bit of help, I have two sources of information and the information is exported to two different CSV file's by different programs. They are supposed to include the same information, however this is what needs to be checked.
Therefore what I would like to do is as follows:
Take the information from the two files.
Compare
Output any differences and which file the difference was in. (e.g File A Contained this, but File B did not and vice versa).
The files are 200,000 odd rows so will need to be as effective as possible.
Tried doing this with Excel however has proved to be too complicated and I'm really struggling to find a way programatically.
Assuming that the files are really supposed to be identical, right down to text qualifiers, ordering of rows, and number of rows contained in each file, the simplest approach may be to simply iterate through both files together and compare each line.
using (StreamReader f1 = new StreamReader(path1))
using (StreamReader f2 = new StreamReader(path2)) {
var differences = new List<string>();
int lineNumber = 0;
while (!f1.EndOfStream) {
if (f2.EndOfStream) {
differences.Add("Differing number of lines - f2 has less.");
break;
}
lineNumber++;
var line1 = f1.ReadLine();
var line2 = f2.ReadLine();
if (line1 != line2) {
differences.Add(string.Format("Line {0} differs. File 1: {1}, File 2: {2}", lineNumber, line1, line2);
}
}
if (!f2.EndOfStream) {
differences.Add("Differing number of lines - f1 has less.");
}
}
Depending on your answers to the comments on your question, if it doesn't really need to be done with code, you could do worse than download a compare tool, which is likely to more sophisticated.
(Winmerge for example)
OK, for anyone else that googles this and finds this. Here is what my answer was.
I exported the details to a CSV and ordered them numerically when they were exported for ease of use. Once they were exported as two CSV files, I then used a program called Beyond Compare which can be found here. This allows the files to be compared.
At first I used Beyond Compare manually to test what I was exporting was correct etc, however Beyond Compare does have the ability to be able to use command lines to compare. This then results in everything done programatically, all that has to be done is a user views the results in Beyond Compare. You may be able to export them to another CSV, I havn't looked as the GUI of Beyond Compare is very nice and useful, so it is easier to use this.