Keep lines formed with specific characters - c#

I have a text file with several lines and a list of approved characters that can be used. If there are any characters in a line that are not on the approved list, the entire line needs to be deleted.
How can I go about completing this? C# would be the ideal, but Python, PowerShell or JS would be helpful as well.
Example approved characters: abcdefg
Valid: abc
Invalid: abc1
For my program I want the following list of approved characters:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890#^,;.
After sorting the contents I want it to write them back to the file (without the invalid lines).

Here's a program that filters out all lines that contain invalid characters where args[0] is the input file and args[1] is the output file.
class Program
{
public static async Task Main(string[] args)
{
const string AllowedChars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890#^,;.";
var lines = File.ReadAllText(args[0]);
using StreamWriter outfile = new (args[1]);
foreach (string line in lines)
if (line.All(x => AllowedChars.Contains(x)))
await file.WriteLineAsync(line);
}
}

You can try using Linq in order to query the file:
using System.IO;
using System.Linq;
...
// HashSet<T> is more efficient than List<T> for Contains: O(1) vs. O(N)
HashSet<char> allowed = new HashSet<char>(
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890#^,;."
);
string fileName = #"c:\MyFile.txt";
var clearedLines = File
.ReadLines(fileName)
.Where(line => line.All(letter => allowed.Contains(letter)))
.ToArray(); // Since we have to write back, we have to materialize the data
File.WriteAllLines(fileName, clearedLines);

Related

How to I get all files from a directory with a variable extension of specified length?

I have a huge directory I need retrieve files from including subdirectories.
I have files that are folders contain various files but I am only interested in specific proprietary files named with an extension with a length of 7 digits.
For example, I have folder that contains the following files:
abc.txt
def.txt
GIWFJ1XA.0201000
GIWFJ1UC.0501000
NOOBO0XA.0100100
summary.pdf
someinfo.zip
T7F4JUXA.0300600
vxy98796.csv
YJHLPLBO.0302300
YJHLPLUC.0302800
I have tried the following:
var fileList = Directory.GetFiles(someDir, "*.???????", SearchOption.AllDirectories)
and also
string searchSting = string.Empty;
for (int j = 0; j < 9999999; j++)
{
searchSting += string.Format(", *.{0} ", j.ToString("0000000"));
}
var fileList2 = Directory.GetFiles(someDir, searchSting, SearchOption.AllDirectories);
which errors because the string is too long obviously.
I want to only return the files with the specified length of the extension, in this case, 7 digits to avoid having to loop over the thousands I would have to process.
I have considered creating a variable string for the search criteria that would contain all 99,999,999 possible digits but d
How can I accomplish this?
I don't believe there's a way you can do this without looping through the files in the directory and its subfolders. The search pattern for GetFiles doesn't support regular expressions, so we can't really use something like [\d]{7} as a filter. I would suggest using Directory.EnumerateFiles and then return the files that match your criteria.
You can use this to enumerate the files:
private static IEnumerable<string> GetProprietaryFiles(string topDirectory)
{
Func<string, bool> filter = f =>
{
string extension = Path.GetExtension(f);
// is 8 characters long including the .
// all remaining characters are digits
return extension.Length == 8 && extension.Skip(1).All(char.IsDigit);
};
// EnumerateFiles allows us to step through the files without
// loading all of the filenames into memory at once.
IEnumerable<string> matchingFiles =
Directory.EnumerateFiles(topDirectory, "*", SearchOption.AllDirectories)
.Where(filter);
// Return each file as the enumerable is iterated
foreach (var file in matchingFiles)
{
yield return file;
}
}
Path.GetExtension includes the . so we check that the number of characters including the . is 8, and that all remaining characters are digits.
Usage:
List<string> fileList = GetProprietaryFiles(someDir).ToList();
I would just grab the list of files in the directory, and then check if the substring length after the '.' is equal to 7. (* As long as you know no other files would have that length extension)
EDITED to use Path instead:
Directory.GetFiles(#"C:\temp").Where(
fileName => Path.GetExtension(fileName).Length == 8
).ToList();
OLD:
Directory.GetFiles(someDir).Where(
fileName => fileName.Substring(fileName.LastIndexOf('.') + 1).Length == 7
).ToList();
Consider files as Directory.GetFiles() result.
using System;
using System.Collections.Generic;
using System.Linq;
using System.IO;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
List<string> files = new List<string>()
{"abc.txt", "def.txt", "GIWFJ1XA.0201000", "GIWFJ1UC.0501000", "NOOBO0XA.0100100", "summary.pdf", "someinfo.zip", "T7F4JUXA.0300600", "vxy98796.csv", "YJHLPLBO.0302300", "YJHLPLUC.0302800"};
Regex r = new Regex("^\\.\\d{7}$");
foreach (string file in files.Where(o => r.IsMatch(Path.GetExtension(o))))
{
Console.WriteLine(file);
}
}
}
Output:
GIWFJ1XA.0201000
GIWFJ1UC.0501000
NOOBO0XA.0100100
T7F4JUXA.0300600
YJHLPLBO.0302300
YJHLPLUC.0302800
Edit: I tried (r.IsMatch) instead of using o but dotnetfiddle Compiler is giving me error saying
Compilation error (line 14, col 27): The call is ambiguous between the following methods or properties: 'System.Linq.Enumerable.Where<string>(System.Collections.Generic.IEnumerable<string>, System.Func<string,bool>)' and 'System.Linq.Enumerable.Where<string>(System.Collections.Generic.IEnumerable<string>, System.Func<string,int,bool>)'
Can't debug it since I am busy now, I'd be happy if anyone passing by suggest any fix for that. But the current code above works.

Need to refer to second to the last element of array of partial filenames

I need to find distinct values of partial filenames in an array of filenames. I'd like to do it in one line.
So, I have something like that as a filenames:
string[] filenames = {"aaa_ab12345.txt", "bbb_ab12345.txt", "aaa_ac12345.txt", "bbb_ac12345"}
and I need to find distinct values for ab12345 part of it.
So I currently have something like that:
string[] filenames_partial_distinct = Array.ConvertAll(
filenames,
file => System.IO.Path.GetFileNameWithoutExtension(file)
.Split({"_","."}, StringSplitOptions.RemoveEmptyEntries)[1]
)
.Distinct()
.ToArray();
Now, I'm getting filenames that are of form of aaa_bbb_ab12345.txt. So, instead of referring to the second part of the filename, I need to refer to the second to the last.
So, how do I refer to an arbitrary element based on length of array in one line, if it's a result of Split method? Something along lines of:
Array.ConvertAll(filenames, file=>file.Split(separator)[this.Length-2]).Distinct().ToArray();
In other words, if a string method results in an array of strings, how do I immediately select element based on the length of array:
String.Split()[third from end, fifth from end, etc.];
If you use GetFileNameWithoutExtension there will be no extension and therefore splitting by '_' will do it. Then you can take the last part with .Last().
string[] filenames_partial_distinct = Array.ConvertAll(
filenames,
file => Path.GetFileNameWithoutExtension(file).Split('_').Last()
)
.Distinct()
.ToArray();
With the input
string[] filenames = { "aaa_ab12345.txt", "bbb_ab12345.txt",
"aaa_ac12345.txt", "bbb_ac12345", "aaa_bbb_ab12345.txt" };
You get the result
{ "ab12345", "ac12345" }
The StringSplitOptions.RemoveEmptyEntries is only required if there are filenames ending with _ (before the extension).
Seems you're looking for something like this:
string[] arr = filenames.Select(n => n.Substring(n.IndexOf("_") + 1, 7)).Distinct().ToArray();
I usually defer problems like this to regex. They are very powerful. This approach also gives you the opportunity to detect unexpected cases and handle them appropriately.
Here is a crude example, assuming I understood your requirements:
using System;
using System.Linq;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string MyMatcher(string filename)
{
// this pattern may need work depending on what you need - it says
// extract that pattern between the "()" which is 2 characters and
// 4 digits, exactly; and can be found in `Groups[1]`.
Regex r = new Regex(#".*_(\w{2}\d{4}).*", RegexOptions.IgnoreCase);
Match m = r.Match(filename);
return m.Success
? m.Groups[1].ToString()
: null; // what should happen here?
}
string[] filenames =
{
"aaa_ab12345.txt",
"bbb_ab12345.txt",
"aaa_ac12345.txt",
"bbb_ac12345",
"aaa_bbb_ab12345.txt",
"ae12345.txt" // MyMatcher() return null for this - what should you do if this happens?
};
var results = filenames
.Select(MyMatcher)
.Distinct();
foreach (var result in results)
{
Console.WriteLine(result);
}
}
}
Gives:
ab1234
ac1234
This can be refined further, such as pre-compiled regex patterns, encapsulation in a class, etc.

C# Regex Pattern to remove comma inside double quote delimited string

I can't be the first person to have this issue but hours of searching Stack revealed nothing close to an answer. I have an SSIS script that works over a directory of csv files. This script folds, bends and mutilates these files; performs queries, data cleansing, persists some data and finally outputs a small set to csv file that is ingested by another system.
One of the files has a free text field that contains the value: "20,000 BONUS POINTS". This one field, in a file of 10k rows, one of dozens of similar files, is the problem that I can't seem to solve.
Be advised: I'm weak on both C# and Regex.
Sample csv set:
4121,6383,0,,,TRUE
4122,6384,0,"20,000 BONUS POINTS",,TRUE
4123,6385,,,,
4124,6386,0,,,TRUE
4125,6387,0,,,TRUE
4126,6388,0,,,TRUE
4127,6389,0,,,TRUE
4128,6390,0,,,TRUE
I found plenty of information on how to parse this using a variety of Regex patterns but what I've noticed is the StreamReader.ReadLine() method wraps the complete line with double quotes:
"4121,6383,0,,,TRUE"
such that the output of the regex Replace method:
s = Regex.Replace(line, #"[^\""]([^\""])*[^\""]",
m => m.Value.Replace(",", ""));
looks like this:
412163830TRUE
and the target line that actually contains a double quote delimited string ends up looking like:
"412263840\"20000 BONUS POINTS\"TRUE"
My entire method (for your reading pleasure) is this:
string fileDirectory = "C:\\tmp\\Unzip\\";
string fullPath = "C:\\tmp\\Unzip\\test.csv";
string line = "";
//int count=0;
List<string> list = new List<string>();
try
{
//MessageBox.Show("inside Try Block");
string s = null;
StreamReader infile = new StreamReader(fullPath);
StreamWriter outfile = new StreamWriter(Path.Combine(fileDirectory, "output.csv"));
while ((line = infile.ReadLine()) != null)
{
//line.Substring(0,1).Substring(line.Length-1, 1);
System.Console.WriteLine(line);
Console.WriteLine(line);
line =
s = Regex.Replace(line, #"[^\""]([^\""])*[^\""]",
m => m.Value.Replace(",", ""));
System.Console.WriteLine(s);
list.Add(s);
}
foreach (string item in list)
{
outfile.WriteLine(item);
};
infile.Close();
outfile.Close();
//System.Console.WriteLine("There were {0} lines.", count);
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
//another addition for TFS consumption
}
Thanks for reading and if you have a useful answer, bless you and your prodigy for generations to come!
mfc
EDIT: The requirement is a valid csv file output. In the case of the test data, it would look like this:
4121,6383,0,,,TRUE
4122,6384,0,"20000 BONUS POINTS",,TRUE
4123,6385,,,,
4124,6386,0,,,TRUE
4125,6387,0,,,TRUE
4126,6388,0,,,TRUE
4127,6389,0,,,TRUE
4128,6390,0,,,TRUE
I recommend using a CSV reader lib like others have suggested.
Install-Package LumenWorksCsvReader
https://github.com/phatcher/CsvReader#getting-started
However, if you just want to try something fast and dirty. Give this a try.
If I understand correctly. You need to remove commas between double quotes within each line of a CSV file. This should do that.
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string pattern = #"([""'])(?:(?=(\\?))\2.)*?\1";
List<string> lines = new List<string>();
lines.Add("4121,6383,0,,,TRUE");
lines.Add("4122,6384,0,\"20,000 BONUS POINTS\",,TRUE");
lines.Add("4123,6385,,,,");
lines.Add("4124,6386,0,,,TRUE");
lines.Add("4125,6387,0,,,TRUE");
lines.Add("4126,6388,0,,,TRUE");
lines.Add("4127,6389,0,,,TRUE");
lines.Add("4128,6390,0,,,TRUE");
StringBuilder sb = new StringBuilder();
foreach (var line in lines)
{
sb.Append(Regex.Replace(line, pattern, m => m.Value.Replace(",", ""))+"\n");
}
Console.WriteLine(sb.ToString());
}
}
OUTPUT
4121,6383,0,,,TRUE
4122,6384,0,"20000 BONUS POINTS",,TRUE
4123,6385,,,,
4124,6386,0,,,TRUE
4125,6387,0,,,TRUE
4126,6388,0,,,TRUE
4127,6389,0,,,TRUE
4128,6390,0,,,TRUE
https://dotnetfiddle.net/flmWG3
I haven't tried with numerous lines, but this would be my first approach:
namespace ConsoleTestApplication
{
class Program
{
static void Main(string[] args)
{
var before = "4122,6384,0,\"20,000 BONUS POINTS\",,TRUE";
var pattern = #"""[^""]*""";
var after = Regex.Replace(before, pattern, match => match.Value.Replace(",", ""));
Console.WriteLine(after);
}
}
}

How Do I get Files From C# Directory

I need to get all files with prefix 009 from a server path.
But my code retrieving all files with 0000 prefix not specifically that starts with 009.
For example, I have files "000028447_ ghf.doc","0000316647 abcf.doc","009028447_ test2.doc","abcd.doc".
string [] files =Directory.GetFiles(filePath,"009*.doc)
is giving me all files except "abcd.doc". But I need "009028447_ test2.doc" instead.
If im giving Directory.GetFiles(filePath,"ab*.doc) it will retrieve "abcd.doc", and working as fine.But When im trying to give a pattern like "009"or "00002" it wont work as expected.
Your code snippet is missing a closing quote-character in the pattern. The code should be:
string[] files = Directory.GetFiles(filePath, "009*.doc");
Other than that, it seems to be working as intended. I've tested this by creating a folder with the files you mention in the question:
Next I created a console application, which uses your code to find the files, and prints all the results to the console. The output is the expected result:
C:\testfolder\009028447_ test2.doc
Here is the entire code for the console application:
using System;
using System.IO;
class Program
{
static void Main(string[] args)
{
string filePath = #"C:\testfolder";
string[] files = Directory.GetFiles(filePath, "009*.doc");
// Creates a string with all the elements of the array, separated by ", "
string matchingFiles = string.Join(", ", files);
Console.WriteLine(matchingFiles);
// Since there is only one matching file, the above line only prints:
// C:\testfolder\009028447_ test2.doc
}
}
In conclusion, the code works. If you are getting other results, there must be other differences in your setup or code that you haven't mentioned.
If (and I did not check,) it is true that you are only receiving the wrong Files you should maybe use a foreach or linq to check if the Files match your criteria:
Foreach:
List<string> arrPaths = new List<string>();
Foreach(string strPath in Directory.GetFiles(filePath,".doc"))
{
if(strPath.EndsWith(".doc") & strPath.StartsWith("009"))
arrPaths.Add(strPath);
}
Linq:
List<string> arrPaths = Directory.GetFiles(filePath,".doc").Where(pths => pths.StartsWith("009") && pths.EndsWith(".doc")).ToList();
Both ways are more a workaround than a real solution, but I hope they're helping:)
EDIT
If you want to only get the Filenames i would subtract the filePath from your strPath like this:
Foreach:
arrPaths.Add(strPath.Replace(filePath + "\\", ""));
Linq:
List<string> arrPaths = Directory.GetFiles(filePath,".doc").Where(pt => pt.StartsWith("009") && pths.EndsWith(".doc")).Select(pths => pths.ToString().Replace(filePath + "\\", "").ToList();

Convert a tab delimited file into CSV file in c# [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to covert tab separated file to CSV file
i have a tab delimited text file which i have to convert into CSV file all this must be done through C# code. My txt file is very large about(1.5 GB), hence i want to convert it in a quick time. please help me.
If your input tab delimited text file does not have any commas are part of the data, then it is a very straightforward find and replace similar to the other answers here:
var lines = File.ReadAllLines(path);
var csv= lines.Select(row => string.Join(",", row.Split('\t')));
File.WriteAllLines(path, csv);
But if your data has commas, doing this is going to break your columns as you now have extra commas that are not supposed to be delimiters, but will be interpreted as such. How to handle it depends greatly on what you application you will be using to read the CSV.
A Microsoft Excel compatible CSV is going to have double quotes around fields with commas to make sure they are interpreted as data and not a delimiter. This also means that fields that contain double quotes as data will need special treatment.
I would recommend a similar approach with an extension method.
var input = File.ReadAllLines(path);
var lines = input.Select(row => row.Split('\t'));
lines = lines.Select(row => row.Select(field => field.EscapeCsvField(',', '"')).ToArray());
var csv = lines.Select(row => string.Join(",", row));
File.WriteAllLines(path, csv.ToArray());
And here's the EscapeCsvField extension method:
static class Extension
{
public static String EscapeCsvField(this String source, Char delimiter, Char escapeChar)
{
if (source.Contains(delimiter) || source.Contains(escapeChar))
return String.Format("{0}{1}{0}", escapeChar, source);
return source;
}
}
Also, if the file is large, it might be best to not read the entire file into memory. In that case, I would suggest writing the CSV output to a different file and then you could use StreamReader and StreamWriter to only work with it 1 line at a time.
var tabPath = path;
var csvPath = Path.Combine(
Path.GetDirectoryName(path),
String.Format("{0}.{1}", Path.GetFileNameWithoutExtension(path), "csv"));
using (var sr = new StreamReader(tabPath))
using (var sw = new StreamWriter(csvPath, false))
{
while (!sr.EndOfStream)
{
var line = sr.ReadLine().Split('\t').Select(field => field.EscapeCsvField(',', '"')).ToArray();
var csv = String.Join(",", line);
sw.WriteLine(csv);
}
}
File.Delete(tabPath);
var csv = File.ReadAllLines("Path").Select(line => line.Replace("\t", ","));
You could simply call
public void ConvertToCSV(string strPath, string strOutput)
{
File.WriteAllLines(strOutput, File.ReadAllLines("Path").Select(line => line.Replace("\t", ",")));
}
There is a lot of content already on SO for handling .CSV files, please search first or trying something.
If the format of your file is strict, you could use string.Split and string.Join:
var lines = File.ReadAllLines(path);
var newLines = lines.Select(l => string.Join(",", l.Split('\t')));
File.WriteAllLines(path, newLines);

Categories

Resources