How to extract an unknown amount of text from a file

How to extract an unknown amount of text from a file - c#

This sort of ties back to a question I had earlier about a regex to search for a method containing a particular string, and someone suggested I use this MS tool called Roslyn but it's not available for VS2010 since 2012 came out.
So I'm writing this small utility to keep a list of every file in my solution that contains a particular method declaration (something like 3k of the 25k files overload this method). Then I simply want to filter that list of files to only ones that contain += inside the body of the method.
static void DirSearch(string dir)
{
string[] files = Directory.GetFiles(dir, "*.*", SearchOption.AllDirectories);
foreach (var file in files)
{
var contents = File.ReadAllText(file);
if (contents.Contains("void DetachEvents()"))
{
//IF DetachEvents CONTAINS += THEN...
WriteToFile(file);
}
}
}
This method iterates over all the folders and writes the file name to a text file if it contains the key method, but I have no idea how to extract just whatevers in the method body, since it's overloaded all 3K instances of the method are different.
Would the best approach to be get the index of the method name, then the index of each { and } until I encounter the next accessor modifier (signifying I've gotten to the end of DetachEvents)? Then I could just search between indexOfMethod and indexOfEndMethod for +=.
But it sounds really sloppy, I was hoping someone might have a better idea?

Do you have to do this in code? Is this a one time utility to identify the problem methods? Why not use something like Notepad++ and it's Find in Files capabilities. You can filter your find pretty easily and even apply regex (I think). From there you can copy the results which include the file name (i.e. someclassfile.cs) and get a list from there.

I wrote this really sloppy winform that lets the user type in the folder to the code base, the method name, and the flagrant text they're looking for. Then it loops over every file in the directory and calls this method on a string that contains all the text of the file. It returns true if the user-entered flagrant data is present, then the method that calls this adds the file its on to a list. Anyways, here's the major code:
private bool ContainsFlag(string contents)
{
int indexOfMethodDec = contents.IndexOf(_method);
int indexOfNextPublicMethod = contents.IndexOf("public", indexOfMethodDec);
if (indexOfNextPublicMethod == -1)
indexOfNextPublicMethod = int.MaxValue;
int indexOfNextPrivateMethod = contents.IndexOf("private", indexOfMethodDec);
if (indexOfNextPrivateMethod == -1)
indexOfNextPrivateMethod = int.MaxValue;
int indexOfNextProtectedMethod = contents.IndexOf("protected", indexOfMethodDec);
if (indexOfNextProtectedMethod == -1)
indexOfNextProtectedMethod = int.MaxValue;
int[] indeces = new int[3]{indexOfNextPrivateMethod,
indexOfNextProtectedMethod,
indexOfNextPublicMethod};
int closestToMethod = indeces.Min();
if (closestToMethod.Equals(Int32.MaxValue))
return false; //This should probably do something different.. This condition is true if the method you're reading is the last method in the class, basically
if (closestToMethod - indexOfMethodDec < 0)
return false;
string methodBody = contents.Substring(indexOfMethodDec, closestToMethod - indexOfMethodDec);
if (methodBody.Contains(_flag))
return true;
return false;
}
Plenty of room for improvement, this is mostly just a proof-of-concept thing that'll get used maybe twice per year internally. But for my purposes it worked. Should be a good starting-point for something more sophisticated if anyone needs it.

Related

Get files from directory and subdirectories quickly order by latest creation date

I am looking for a method that will take a file extension type and directory and return all the files within this directory and sub directories ordered by the latest creation date, i.e. latest files first.
So far i have identified the following method which is meant to be fast however is there a better way of doing this and i need it to return FileInfo rather than a string and ordered as described above.
public static IEnumerable<string> GetFileList(string fileSearchPattern, string rootFolderPath)
{
Queue<string> pending = new Queue<string>();
pending.Enqueue(rootFolderPath);
string[] tmp;
while (pending.Count > 0)
{
rootFolderPath = pending.Dequeue();
tmp = Directory.GetFiles(rootFolderPath, fileSearchPattern);
for (int i = 0; i < tmp.Length; i++)
{
yield return tmp[i];
}
tmp = Directory.GetDirectories(rootFolderPath);
for (int i = 0; i < tmp.Length; i++)
{
pending.Enqueue(tmp[i]);
}
}
}

When I have researched this problem space I've found there isn't a fast way to do this. The reason is no matter what approach you take, you end up having to go to the Operating System for the list of files in a directory. And the file system doesn't cache / index the way a search engine would. So you end up need to recrawl the file system yourself.
Once you have the raw information, however, you can index it yourself.

The below will work for your purposes. You want to use Directory.EnumerateFiles(...) to allow the file list to use less memory up front. It will only go looking for the next element when you ask for it instead of loading the entire collection in to memory at the start.
Directory.EnumerateFiles(rootFolderPath, fileSearchPattern, System.IO.SearchOption.AllDirectories).OrderBy(file => new FileInfo(file).CreationTime)
One additional consideration. Since you are doing a fairly blind search through the file system, if you try to enumerate a file and an exception is thrown, it will invalidate the enumerator causing it to exit without finishing. I have posted a solution to that problem here

Directory.GetFiles does have an option to search recursively.
The following should work, although I haven't tried it.
IEnumerable<FileInfo> GetFileList(string directory, string extension)
{
return Directory.GetFiles(directory, "*" + extension, SearchOption.AllDirectories)
.Select(f => new FileInfo(f))
.OrderByDescending(f => f.CreationTime);
}

C# Remove file extension from string list

What my program does is basically it lists file names (including it's extension) from a directory into a listbox. It then has a sorting function which sorts the list strings into alphabetical order.
Lastly it has a binary search function that allows the users to input any string which the program will then compare and display the matched results into a listbox.
Now, all these functions work perfectly however I can't seem to remove the extension off of a file name after a search.
For example in the scanning and sorting part it lists the file names as: filename.mp3
Now, what I want it do when the searching button is clicked is to remove the file extension and display just the filename.
private void buttonSearch_Click(object sender, RoutedEventArgs e)
{
listBox1.Items.Clear();
string searchString = textBoxSearchPath.Text;
int index = BinarySearch(list1, 0, list1.Count, searchString);
for (int n = index; n < list1.Count; n++)
{
//Removes file extension from last decimal point ''not working''
int i = list1[n].LastIndexOf(".");
if (i > 0)
list1[n].Substring(0, i);
// Adds items to list
if (list1[n].IndexOf(searchString, StringComparison.OrdinalIgnoreCase) != 0) break;
listBox1.Items.Add(list1[n]);
}
MessageBox.Show("Done");
}

C# is so easy that if something takes more than 2 minutes, there probably is a method for it in the Framework.

The Substring method returns a new fresh copy of the string, copied from the source one. If you want to "cut the extension off", then you must fetch what Substring returns and store it somewhere, i.e.:
int i = list1[n].LastIndexOf(".");
if (i > 0)
list1[n] = list1[n].Substring(0, i);
However, this is quite odd way to remove an extension.
Firstly, use of Substring(0,idx) is odd, as there's a Remove(idx)(link) which does exactly that:
int i = list1[n].LastIndexOf(".");
if (i > 0)
list1[n] = list1[n].Remove(i);
But, sencondly, there's even better way of doing it: the System.IO.Path class provides you with a set of well written static methods that, for example, remove the extension (edit: this is what L-Three suggested in comments), with full handling of dots and etc:
var str = System.IO.Path.GetFileNameWithoutExtension("myfile.txt"); // == "myfile"
See MSDN link
It still returns a copy and you still have to store the result somewhere!
list1[n] = Path.GetFileNameWithoutExtension( list1[n] );

Try like below ite will help you....
Description : Filename without Extension
listBox1.Items.Add(Path.GetFileNameWithoutExtension(list1[n]));

Use Path.GetFileNameWithoutExtension

Use Path.GetFileNameWithoutExtension method. Quite easy I guess.
http://msdn.microsoft.com/en-us/library/system.io.path.getfilenamewithoutextension.aspx

Not sure how you've implemented your directory searching, but you can leverage LINQ to your advantage in these situations for clean, easy to read code:
var files = Directory.EnumerateFiles(#"\\PathToFiles")
.Select(f => Path.GetFileNameWithoutExtension(f));
If you're using .NET 4.0, Enumerate files seems to be a superior choice over GetFiles. However it also sounds like you want to get both the full file path and the file name without extension. Here's how you could create a Dictionary so you'd eliminate looping through the collection twice:
var files = Directory.EnumerateFiles(#"\\PathToFiles")
.ToDictionary(f => f, n => Path.GetFileNameWithoutExtension(n));

A way to do this if you don't have a file path, just a file Name
string filePath = (#"D:/" + fileName);
string withoutExtension = Path.getFileNameWithoutExtension(filePath);

An elegant way of renaming a file if it already exists when archiving

I'd like to write an archiving function that takes all files in a folder and move them into an archive sub folder using the current date. The process could be run several times in a day and therefore needs to handle duplicates.
The rules for the archiving are below:
If a file name already exists, I'd like to add an underscore "_" and
a number (starting from 1).
If the file name has already been modified the I'd like
to increment the number.
I can do this using lots of File.Exist and LastIndexOf calls but is there a more elegant way? maybe with LINQ?
EDIT:
This is the code I have already. It's a bit rough and ready but it gives an idea of what I want to do.
/// <summary>
/// Move the local file into the archive location.
/// If the file already exists then add a counter to the file name or increment the existing counter
/// </summary>
/// <param name="LocalFilePath">The full path of the file to be archived</param>
/// <param name="ArchiveFilePath">The proposed full path of the file once it's been archived</param>
private void ArchiveFile(string LocalFilePath, string ArchiveFilePath)
{
// Ensure the file name doesn't already exists in the location we want to move it to
if (File.Exists(ArchiveFilePath) == true)
{
// Does the archive file have a number at the end?
string[] archiveSplit = Path.GetFileNameWithoutExtension(ArchiveFilePath).Split('_');
if( archiveSplit.Length == 1)
{
// No number detected so append the number 1 to the filename
string newArcFileName = string.Format("{0}_1.{1}",
Path.GetFileNameWithoutExtension(ArchiveFilePath), Path.GetExtension(ArchiveFilePath));
// Create the new full path
string newArcPath = Path.Combine(Path.GetDirectoryName(ArchiveFilePath), newArcFileName);
// recursively call the archive folder to ensure the new file name doesn't exist before moving
ArchiveFile( LocalFilePath, newArcPath);
}
else
{
// Get the number of the last element of the split
int lastNum = Convert.ToInt32( archiveSplit.Last().Substring(1) ) +1;
// Rebuild the filename using the incremented number
string newArcFileName = archiveSplit[0];
for (int i = 1; i < archiveSplit.Length; i++)
{
newArcFileName += archiveSplit[i];
}
// finally add the number and extension
newArcFileName = string.Format("{0}_{1}.{2}", newArcFileName, lastNum, Path.GetExtension(ArchiveFilePath));
// Create the new full path
string newArcPath = Path.Combine(Path.GetDirectoryName(ArchiveFilePath), newArcFileName);
// recursively call the archive folder to ensure the new file name doesn't exist before moving
ArchiveFile(LocalFilePath, newArcPath);
}
}
else
{
// There is no file with a matching name
File.Move(LocalFilePath, ArchiveFilePath);
}
}

The Directory class has a method to receive a list of all files within. That method allows you to specify a filter string, like so:
Directory.GetFiles(directoryPath, filterString);
If you already know your filename prefix, you can use that filter string to get all the files within that pattern:
filterString = string.Format("{0}_*.{1}", defaultFileNameWithoutExtension, defaultExtension);
You can then simply select the one with the highest suffix, extract the suffix digits, increase it and build your new (unused) file name.
Disclaimer: This was written by heart, feel free to edit in case of errors :)

File.Exists would still need to be called even if you use LINQ, that will not change.
I suggest keeping things simple - looping with File.Exists and LastIndexOf is a suitable solution, unless performance is imperative.

Maybe, you should use the "Path" API and use EndsWith instead of LastIndexOf :).
You can also have a file wich store the tree of files. (Took an eye to rsync)
Do you really want to make several duplicates of the same files even if it doesn't change ? Are you looking for an updated "modified datetime" ?
http://msdn.microsoft.com/en-us/library/system.io.path.aspx : Path

how to check if i have At least one file *.bak in my folder?

how to check if i have At least one file *.bak in my folder ?

You can list all files in a particular directory using Directory.GetFiles(). The second parameter is a pattern to search for (which includes wildcards). Something like this should do it:
var hasBak = Directory.GetFiles(yourdir, "*.bak").Length > 0;

Directory.GetFiles is correct, but not the best solution if you are using C# 4.0, because we have:
bool exist = Directory.EnumerateFiles(#"C:\mydir", "*.bak").Any();
Directory.GetFiles returns all those matched files, and you can check the Length property. But when we invoke Any to Directory.EnumerateFiles, essentially we get its enumerator and MoveNext, the method returns as soon as we found any item in it(in this way we always don't need to loop through all the files). I checked their implementation, and test with:
Directory.EnumerateFiles(#"C:\Windows", "*.log").Any();
GetFiles costs 4x time than EnumerateFiles(run them 10000 times, measuring by StopWatch).

Well, you can use Directory.GetFiles(directory. "*.bak") to get the list of bak files, and then just check whether the length is 0 or not.
if (Directory.GetFiles(directory, "*.bak").Length == 0)
{
// Complain to the user or whatever you want to do
}

public bool IsAtleastOneFilePresent()
{
string[] filePaths = Directory.GetFiles(#"c:\MyDir\", "*.bak");
if(filePaths.Length > 0) return true; else return false;
}

string[] files = Directory.GetFiles(#"c:\SomeDirectory\", "*.bak");
and ensure that files.Length > 0

Use Directory.GetFiles
Directory.GetFiles([dir], "*.bak")
http://msdn.microsoft.com/en-us/library/wz42302f.aspx

Importing data files using generic class definitions

I am trying to import a file with multiple record definition in it. Each one can also have a header record so I thought I would define a definition interface like so.
public interface IRecordDefinition<T>
{
bool Matches(string row);
T MapRow(string row);
bool AreRecordsNested { get; }
GenericLoadClass ToGenericLoad(T input);
}
I then created a concrete implementation for a class.
public class TestDefinition : IRecordDefinition<Test>
{
public bool Matches(string row)
{
return row.Split('\t')[0] == "1";
}
public Test MapColumns(string[] columns)
{
return new Test {val = columns[0].parseDate("ddmmYYYY")};
}
public bool AreRecordsNested
{
get { return true; }
}
public GenericLoadClass ToGenericLoad(Test input)
{
return new GenericLoadClass {Value = input.val};
}
}
However for each File Definition I need to store a list of the record definitions so I can then loop through each line in the file and process it accordingly.
Firstly am I on the right track
or is there a better way to do it?

I would split this process into two pieces.
First, a specific process to split the file with multiple types into multiple files. If the files are fixed width, I have had a lot of luck with regular expressions. For example, assume the following is a text file with three different record types.
TE20110223 A 1
RE20110223 BB 2
CE20110223 CCC 3
You can see there is a pattern here, hopefully the person who decided to put all the record types in the same file gave you a way to identify those types. In the case above you would define three regular expressions.
string pattern1 = #"^TE(?<DATE>[0-9]{8})(?<NEXT1>.{2})(?<NEXT2>.{2})";
string pattern2 = #"^RE(?<DATE>[0-9]{8})(?<NEXT1>.{3})(?<NEXT2>.{2})";
string pattern3 = #"^CE(?<DATE>[0-9]{8})(?<NEXT1>.{4})(?<NEXT2>.{2})";
Regex Regex1 = new Regex(pattern1);
Regex Regex2 = new Regex(pattern2);
Regex Regex3 = new Regex(pattern3);
StringBuilder FirstStringBuilder = new StringBuilder();
StringBuilder SecondStringBuilder = new StringBuilder();
StringBuilder ThirdStringBuilder = new StringBuilder();
string Line = "";
Match LineMatch;
FileInfo myFile = new FileInfo("yourFile.txt");
using (StreamReader s = new StreamReader(f.FullName))
{
while (s.Peek() != -1)
{
Line = s.ReadLine();
LineMatch = Regex1.Match(Line);
if (LineMatch.Success)
{
//Write this line to a new file
}
LineMatch = Regex2.Match(Line);
if (LineMatch.Success)
{
//Write this line to a new file
}
LineMatch = Regex3.Match(Line);
if (LineMatch.Success)
{
//Write this line to a new file
}
}
}
Next, take the split files and run them through a generic process, that you most likely already have, to import them. This works well because when the process inevitably fails, you can narrow it to the single record type that is failing and not impact all the record types. Archive the main text file along with the split files and your life will be much easier as well.
Dealing with these kinds of transmitted files is hard, because someone else controls them and you never know when they are going to change. Logging the original file as well as a receipt of the import is very import and shouldn't be overlooked either. You can make that as simple or as complex as you want, but I tend to write a receipt to a db and copy the primary key from that table into a foreign key in the table I have imported the data into, then never change that data. I like to keep a unmolested copy of the import on the file system as well as on the DB server because there are inevitable conversion / transformation issues that you will need to track down.
Hope this helps, because this is not a trivial task. I think you are on the right track, but instead of processing/importing each line separately...write them to a separate file. I am assuming this is financial data, which is one of the reasons I think provability at every step is important.

I think the FileHelpers library solves a number of your problems:
Strong types
Delimited
Fixed-width
Record-by-Record operations
I'm sure you could consolidate this into a type hierarchy that could tie in custom binary formats as well.

Have you looked at something using Linq? This is a quick example of Linq to Text and Linq to Csv.
I think it would be much simpler to use "yield return" and IEnumerable to get what you want working. This way you could probably get away with only having 1 method on your interface.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to extract an unknown amount of text from a file - c#

Related

Get files from directory and subdirectories quickly order by latest creation date

C# Remove file extension from string list

An elegant way of renaming a file if it already exists when archiving

how to check if i have At least one file *.bak in my folder?

Importing data files using generic class definitions

Categories

Resources