How to read .txt and count word/length, etc

How to read .txt and count word/length, etc - c#

I wrote a exam last week and had a really hard task to solve and didn't got the point.
I had a .txt with a Text.
The Text is like this:
Der zerbrochne Krug, ein Lustspiel,
von Heinrich von Kleist.
Berlin. In der Realschulbuchhandlung.
1811.
[8]
PERSONEN.
WALTER, Gerichtsrath. ADAM, Dorfrichter. LICHT, Schreiber. FRAU MARTHE
RULL. EVE, ihre Tochter. VEIT TÜMPEL, ein Bauer. RUPRECHT, sein Sohn.
FRAU BRIGITTE. EIN BEDIENTER, BÜTTEL, MÄGDE, etc.
Die Handlung spielt in einem niederländischen Dorfe bei Utrecht.
[9] Scene: Die Gerichtsstube. Erster Auftritt.
And i got the Main with this code:
var document = new Document("Text.txt");
if (document.Contains("Haus") == true)
Console.WriteLine(document["Haus"]); // Word: haus, Frequency.: 36, Length: 4
else
Console.WriteLine("Word not found!");
Now i had to write a class which helps to make the code above works.
Does anyone have an idea how to solve this problem and would help a young student of business informatics to understand, how this works?
Normally the StreamReader is easy for me, but in this case it wasn't possible for me...
Thank you very much and much love and healthy for all of you, who tries tohelpme.

Well this is the class you are looking for, hope this might help you.
class Document : Dictionary<string, int>
{
private const char WORDSPLITTER = ' ';
public string Filename { get; }
public Document(string filename)
{
Filename = filename;
Fill();
}
private void Fill()
{
foreach (var item in File.ReadLines(Filename))
{
foreach (var word in item.Split(WORDSPLITTER))
{
if (ContainsKey(word))
base[word] += 1;
else
Add(word, 1);
}
}
}
public bool Contains(string word) => ContainsKey(word);
public new string this[string word]
{
get
{
if (ContainsKey(word))
return $"Word: {word}, frequency: {base[word]}, Length: {word.Length}";
else
return $"Word {word} not found!";
}
}
}

Try the below function :
private bool FindWord( string SearchWord)
{
List<string> LstWords = new List<string>();
string[] Lines = File.ReadAllLines("Path of your File");
foreach (string line in Lines )
{
string[] words = line.Split(' ');
foreach (string word in words )
{
LstWords.Add(word);
}
}
// Find word set word to upper letters and target word to upper
int index = LstWords.FindIndex(x => x.Trim ().ToUpper ().Equals(SearchWord.ToUpper ()));
if (index==-1)
{
// Not Found
return false;
}
else
{
//word found
return true;
}
}

I find that Regex could be a good way to solve this:
var ms = Regex.Matches(textToSearch, wordToFind, RegexOptions.IgnoreCase);
if (ms.Count > 0)
{
Console.WriteLine($"Word: {wordToFind} Frequency: {ms.Count} Length: {wordToFind.Length}");
}
else
{
Console.WriteLine("Word not found!");
}
Regex is in the namespace:
using System.Text.RegularExpressions;
You will need to set the RegexOptions that are appropriate for your problem.

One of the approach would be below steps-
Create a class Document with below properties -
//Contains file name
public string FileName { get; set; }
//Contains file data
public string FileData { get; set; }
//Contains word count
public int WordCount { get; set; }
//Holds all the words
public Dictionary<string, int> DictWords { get; set; } = new Dictionary<string, int>();
Define the constructor which does 2 things -
Assign the property Filename to incoming file
Read the file from the path and get all the words from the file
Find the word count and insert them to dictionary, so the Final dictionary will
have all the <<<'word'>>, <<'TotalCount'>>> records
//Constructor
public Document(string fileName)
{
//1/ Assign File Name name troperty
FileName = fileName;
//2. Read File from the Path
string text = System.IO.File.ReadAllText(fileName, Encoding.Default);
string[] source = text.Split(new char[] { '.', '!', '?', ',', '(', ')', '\t', '\n', '\r', ' ' },
StringSplitOptions.RemoveEmptyEntries);
//3. Add the counts to Dictionary
foreach (String word in source)
{
if (DictWords.ContainsKey(word))
{
DictWords[word]++;
} else
{
DictWords[word] = 1;
}
}
}
Create "Contains" method which will be used to check whether the word is present or
not in the document-
//4. Method will return true /false based on the existence of the key/word.
public bool Contains(string word)
{
if (DictWords.ContainsKey(word))
{
return true;
}
else
{
return false;
}
}
Create an indexer on string for the class to get the desired output to be print to
Console -
//4. Define index on the word.
public string this[string word]
{
get
{
if (DictWords.TryGetValue(word, out int value))
{
return $"Word: {word}, Frequency.:{value}, Length: {word.Length}";
}
return string.Empty;
}
}
Tests :
var document = new Document(#"Text.txt");
if (document.Contains("BEDIENTER") == true)
Console.WriteLine(document["BEDIENTER"]);
else
Console.WriteLine("Word not found!");
//Output
// Word: BEDIENTER, Frequency.:1, Length: 9

Related

C#: read text file separated by additional newline character

I have some sql commands that are separated by an additional newline character:
ALTER TABLE XXX
ALTER COLUMN xxx real
ALTER TABLE YYY
ALTER COLUMN yyy real
ALTER TABLE ZZZ
ALTER COLUMN zzz real
I've tried reading the file by using an array of character separators such as the following,
new char[] { '\n', '\r'}
inside this method:
private static List<string> ReadFile(string FileName, char[] seps)
{
if (!File.Exists(FileName))
{
Console.WriteLine("File not found");
return null;
}
using (StreamReader sr = new StreamReader(FileName, Encoding.Default))
{
string content = sr.ReadToEnd();
return content.Split(seps, StringSplitOptions.RemoveEmptyEntries).ToList();
}
}
However, this doesn't seem to be working. I would like to have each command represented by a separate string. How can I do this?

Why not use File.ReadAllLines()?
private static List<string> ReadFile(string FileName)
{
if (!File.Exists(FileName))
{
Console.WriteLine("File not found");
return null;
}
var lines = File.ReadAllLines(FileName);
return lines.ToList();
}
This will automatically read and split your file by newlines.
If you want to filter out empty lines, do this:
var nonEmpty = ReadFile(path).Where(x => !string.IsNullOrEmpty(x)).ToList();
Side note, I would change your if statement to throw an exception if the file cannot be found.
if (!File.Exists(FileName))
{
throw new FileNotFoundException("Can't find file");
}

You can filter the examples. When I read them in, the empty lines had a length 1 and its char value said 131 for some reason. So I just filtered by length > 1
void Main()
{
var results = ReadFile(#"C:\temp\sql.txt", new char[]{'\n'});
Console.WriteLine(results.Count);
foreach (var result in results)
{
Console.WriteLine(result);
}
}
private static List<string> ReadFile(string FileName, char[] seps)
{
if (!File.Exists(FileName))
{
Console.WriteLine("File not found");
return null;
}
using (StreamReader sr = new StreamReader(FileName, Encoding.Default))
{
string content = sr.ReadToEnd();
return content.Split(seps, StringSplitOptions.RemoveEmptyEntries).Where (c => c.Length > 1).ToList();
}
}

Try This:
private static List<string> ReadFile(string FileName)
{
List<string> commands = new List<string>();
StringBuilder command = new StringBuilder();
if (!File.Exists(FileName))
{
Console.WriteLine("File not found");
return null;
}
foreach (var line in File.ReadLines(FileName))
{
if (!String.IsNullOrEmpty(line))
{
command.Append(line + "\n");
}
else
{
commands.Add(command.ToString());
command.Clear();
}
}
commands.Add(command.ToString());
return commands;
}

If you are sure you'll always have \r\n line endings, you can use:
var commands = content.Split(new []{"\r\n\r\n"}, StringSplitOptions.RemoveEmptyEntries);
Otherwise, try using regex:
var commands = Regex.Split(content, #"\r?\n\r?\n")

Thank you everyone for your answers. I ended up going with this helper method:
private static List<string> GetCommands(string location)
{
List<string> ret = new List<string>();
List<string> tmp = ReadFile(location, new string[] { "\r\n\r\n"});
for (int i = 0; i < tmp.Count; i++)
{
string rem = tmp[i].Replace("\r", "");
ret.Add(rem);
}
return ret;
}
As an aside, the equivalent is so much easier in Python. For example, what I'm trying to do can be expressed in these three lines:
with open('commands.txt', 'r') as f:
content = f.read()
commands = [ command for command in content.split('\n\n') ]

Extracting specific part of a text file in C#

I usually add some strings from a text file into a list or array line by line, although I am now using "#"'s as separators in the text file. How would it be possible to read the two strings "softpedia.com" and "download.com" into a list using the two "#" signs as a breaking point? Baring in mind that there might be more or less strings inbetween the two hashes
e.g.
# Internal Hostnames
softpedia.com
download.com
# External Hostnames
Expected output:
softpedia.com
download.com

class Program
{
static void Main()
{
using (var reader = File.OpenText("test.txt"))
{
foreach (var line in Parse(reader))
{
Console.WriteLine(line);
}
}
}
public static IEnumerable<string> Parse(StreamReader reader)
{
string line;
bool first = false;
while ((line = reader.ReadLine()) != null)
{
if (!line.StartsWith("#"))
{
if (first)
{
yield return line;
}
}
else if (!first)
{
first = true;
}
else
{
yield break;
}
}
}
}
and if you wanted to just get them in a list:
using (var reader = File.OpenText("test.txt"))
{
List<string> hostnames = Parse(reader).ToList();
}

Read it into a buffer and let regex do the work.
string input = #"
# Internal Hostnames
softpedia.com
download.com
# External Hostnames
";
string pattern = #"^(?!#)(?<Text>[^\r\s]+)(?:\s?)";
Regex.Matches(input, pattern, RegexOptions.Multiline)
.OfType<Match>()
.Select (mt => mt.Groups["Text"].Value)
.ToList()
.ForEach( site => Console.WriteLine (site));
/* Outputs
softpedia.com
download.com
*/

It sounds like you want to read all of the lines in between a set of # start lines. If so try the following
List<string> ReadLines(string filePath) {
var list = new List<string>();
var foundStart = false;
foreach (var line in File.ReadAllLines(filePath)) {
if (line.Length > 0 && line[0] == '#') {
if (foundStart) {
return list;
}
foundStart = true;
} else if (foundStart) {
list.Add(line);
}
}
return line;
}

Searching Specific Data From a File

I have a File having text and few numbers.I just want to extract numbers from it.How do I go about it ???
I tried using all that split thing but no luck so far.
My File is like this:
AT+CMGL="ALL"
+CMGL: 5566,"REC READ","Ufone"
Dear customer, your DAY_BUCKET subscription will expire on 02/05/09
+CMGL: 5565,"REC READ","+923466666666"
KINDLY TELL ME THE WAY TO EXTRACT NUMBERS LIKE +923466666666 from this File so I can put them into another File or textbox.
Thanks

Here's an example using the String.Split. The "number" contains a '+', so really it should be treated as a string not a number. I'm presuming it's a telephone number with the '+' potentially used for international calls? If it is a telephone number, you need to be careful of dashes, spaces in the number as well as extension numbers added to the end eg "+9234 666-66666 ext 235" and so on...
Anyway - hopefully the example is useful in getting to grips with Split.
The code include unit tests using NUnit v2.4.8
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using NUnit.Framework;
using System.Text.RegularExpressions;
namespace SO.NumberExtractor.Test
{
public class NumberExtracter
{
public List<string> ExtractNumbers(string lines)
{
List<string> numbers = new List<string>();
string[] seperator = { System.Environment.NewLine };
string[] seperatedLines = lines.Split(seperator, StringSplitOptions.RemoveEmptyEntries);
foreach (string line in seperatedLines)
{
string s = ExtractNumber(line);
numbers.Add(s);
}
return numbers;
}
public string ExtractNumber(string line)
{
string s = line.Split(',').Last<string>().Trim('"');
return s;
}
public string ExtractNumberWithoutLinq(string line)
{
string[] fields = line.Split(',');
string s = fields[fields.Length - 1];
s = s.Trim('"');
return s;
}
}
[TestFixture]
public class NumberExtracterTest
{
private readonly string LINE1 = "AT+CMGL=\"ALL\" +CMGL: 5566,\"REC READ\",\"Ufone\" Dear customer, your DAY_BUCKET subscription will expire on 02/05/09 +CMGL: 5565,\"REC READ\",\"+923466666666\"";
private readonly string LINE2 = "AT+CMGL=\"ALL\" +CMGL: 5566,\"REC READ\",\"Ufone\" Dear customer, your DAY_BUCKET subscription will expire on 02/05/09 +CMGL: 5565,\"REC READ\",\"+923466666667\"";
private readonly string LINE3 = "AT+CMGL=\"ALL\" +CMGL: 5566,\"REC READ\",\"Ufone\" Dear customer, your DAY_BUCKET subscription will expire on 02/05/09 +CMGL: 5565,\"REC READ\",\"+923466666668\"";
[Test]
public void ExtractOneLineWithoutLinq()
{
string expected = "+923466666666";
NumberExtracter c = new NumberExtracter();
string result = c.ExtractNumberWithoutLinq(LINE1);
Assert.AreEqual(expected, result);
}
[Test]
public void ExtractOneLineUsingLinq()
{
string expected = "+923466666666";
NumberExtracter c = new NumberExtracter();
string result = c.ExtractNumber(LINE1);
Assert.AreEqual(expected, result);
}
[Test]
public void ExtractMultipleLines()
{
StringBuilder sb = new StringBuilder();
sb.AppendLine(LINE1);
sb.AppendLine(LINE2);
sb.AppendLine(LINE3);
NumberExtracter ne = new NumberExtracter();
List<string> extractedNumbers = ne.ExtractNumbers(sb.ToString());
string expectedFirst = "+923466666666";
string expectedSecond = "+923466666667";
string expectedThird = "+923466666668";
Assert.AreEqual(expectedFirst, extractedNumbers[0]);
Assert.AreEqual(expectedSecond, extractedNumbers[1]);
Assert.AreEqual(expectedThird, extractedNumbers[2]);
}
}
}

If the numbers are all at the end of the lines then you can use code like the following
foreach ( string line in File.ReadAllLines(#"c:\path\to\file.txt") ) {
Match result = Regex.Match(line, #"\+(\d+)""$");
if ( result.Success ) {
var number = result.Groups[1].Value;
// do what you want with the number
}
}

How large is the file? If the file is under a few megabytes in size I would recommend loading the file contents into a string and using a compiled regular expression to extract matches.
Here's a quick example:
Regex NumberExtractor = new Regex("[0-9]{7,16}",RegexOptions.Compiled);
/// <summary>
/// Extracts numbers between seven and sixteen digits long from the target file.
/// Example number to be extracted: +923466666666
/// </summary>
/// <param name="TargetFilePath"></param>
/// <returns>List of the matching numbers</returns>
private IEnumerable<ulong> ExtractLongNumbersFromFile(string TargetFilePath)
{
if (String.IsNullOrEmpty(TargetFilePath))
throw new ArgumentException("TargetFilePath is null or empty.", "TargetFilePath");
if (File.Exists(TargetFilePath) == false)
throw new Exception("Target file does not exist!");
FileStream TargetFileStream = null;
StreamReader TargetFileStreamReader = null;
string FileContents = "";
List<ulong> ReturnList = new List<ulong>();
try
{
TargetFileStream = new FileStream(TargetFilePath, FileMode.Open);
TargetFileStreamReader = new StreamReader(TargetFileStream);
FileContents = TargetFileStreamReader.ReadToEnd();
MatchCollection Matches = NumberExtractor.Matches(FileContents);
foreach (Match CurrentMatch in Matches) {
ReturnList.Add(System.Convert.ToUInt64(CurrentMatch.Value));
}
}
catch (Exception ex)
{
//Your logging, etc...
}
finally
{
if (TargetFileStream != null) {
TargetFileStream.Close();
TargetFileStream.Dispose();
}
if (TargetFileStreamReader != null)
{
TargetFileStreamReader.Dispose();
}
}
return (IEnumerable<ulong>)ReturnList;
}
Sample Usage:
List<ulong> Numbers = (List<ulong>)ExtractLongNumbersFromFile(#"v:\TestExtract.txt");

Reading a line from a streamreader without consuming?

Is there a way to read ahead one line to test if the next line contains specific tag data?
I'm dealing with a format that has a start tag but no end tag.
I would like to read a line add it to a structure then test the line below to make sure it not a new "node" and if it isn't keep adding if it is close off that struct and make a new one
the only solution i can think of is to have two stream readers going at the same time kinda suffling there way along lock step but that seems wastefull (if it will even work)
i need something like peek but peekline

The problem is the underlying stream may not even be seekable. If you take a look at the stream reader implementation it uses a buffer so it can implement TextReader.Peek() even if the stream is not seekable.
You could write a simple adapter that reads the next line and buffers it internally, something like this:
public class PeekableStreamReaderAdapter
{
private StreamReader Underlying;
private Queue<string> BufferedLines;
public PeekableStreamReaderAdapter(StreamReader underlying)
{
Underlying = underlying;
BufferedLines = new Queue<string>();
}
public string PeekLine()
{
string line = Underlying.ReadLine();
if (line == null)
return null;
BufferedLines.Enqueue(line);
return line;
}
public string ReadLine()
{
if (BufferedLines.Count > 0)
return BufferedLines.Dequeue();
return Underlying.ReadLine();
}
}

You could store the position accessing StreamReader.BaseStream.Position, then read the line next line, do your test, then seek to the position before you read the line:
// Peek at the next line
long peekPos = reader.BaseStream.Position;
string line = reader.ReadLine();
if (line.StartsWith("<tag start>"))
{
// This is a new tag, so we reset the position
reader.BaseStream.Seek(pos);
}
else
{
// This is part of the same node.
}
This is a lot of seeking and re-reading the same lines. Using some logic, you may be able to avoid this altogether - for instance, when you see a new tag start, close out the existing structure and start a new one - here's a basic algorithm:
SomeStructure myStructure = null;
while (!reader.EndOfStream)
{
string currentLine = reader.ReadLine();
if (currentLine.StartsWith("<tag start>"))
{
// Close out existing structure.
if (myStructure != null)
{
// Close out the existing structure.
}
// Create a new structure and add this line.
myStructure = new Structure();
// Append to myStructure.
}
else
{
// Add to the existing structure.
if (myStructure != null)
{
// Append to existing myStructure
}
else
{
// This means the first line was not part of a structure.
// Either handle this case, or throw an exception.
}
}
}

Why the difficulty? Return the next line, regardless. Check if it is a new node, if not, add it to the struct. If it is, create a new struct.
// Not exactly C# but close enough
Collection structs = new Collection();
Struct struct;
while ((line = readline()) != null)) {
if (IsNode(line)) {
if (struct != null) structs.add(struct);
struct = new Struct();
continue;
}
// Whatever processing you need to do
struct.addLine(line);
}
structs.add(struct); // Add the last one to the collection
// Use your structures here
foreach s in structs {
}

Here is what i go so far. I went more of the split route than the streamreader line by line route.
I'm sure there are a few places that are dieing to be more elegant but for right now it seems to be working.
Please let me know what you think
struct INDI
{
public string ID;
public string Name;
public string Sex;
public string BirthDay;
public bool Dead;
}
struct FAM
{
public string FamID;
public string type;
public string IndiID;
}
List<INDI> Individuals = new List<INDI>();
List<FAM> Family = new List<FAM>();
private void button1_Click(object sender, EventArgs e)
{
string path = #"C:\mostrecent.ged";
ParseGedcom(path);
}
private void ParseGedcom(string path)
{
//Open path to GED file
StreamReader SR = new StreamReader(path);
//Read entire block and then plit on 0 # for individuals and familys (no other info is needed for this instance)
string[] Holder = SR.ReadToEnd().Replace("0 #", "\u0646").Split('\u0646');
//For each new cell in the holder array look for Individuals and familys
foreach (string Node in Holder)
{
//Sub Split the string on the returns to get a true block of info
string[] SubNode = Node.Replace("\r\n", "\r").Split('\r');
//If a individual is found
if (SubNode[0].Contains("INDI"))
{
//Create new Structure
INDI I = new INDI();
//Add the ID number and remove extra formating
I.ID = SubNode[0].Replace("#", "").Replace(" INDI", "").Trim();
//Find the name remove extra formating for last name
I.Name = SubNode[FindIndexinArray(SubNode, "NAME")].Replace("1 NAME", "").Replace("/", "").Trim();
//Find Sex and remove extra formating
I.Sex = SubNode[FindIndexinArray(SubNode, "SEX")].Replace("1 SEX ", "").Trim();
//Deterine if there is a brithday -1 means no
if (FindIndexinArray(SubNode, "1 BIRT ") != -1)
{
// add birthday to Struct
I.BirthDay = SubNode[FindIndexinArray(SubNode, "1 BIRT ") + 1].Replace("2 DATE ", "").Trim();
}
// deterimin if there is a death tag will return -1 if not found
if (FindIndexinArray(SubNode, "1 DEAT ") != -1)
{
//convert Y or N to true or false ( defaults to False so no need to change unless Y is found.
if (SubNode[FindIndexinArray(SubNode, "1 DEAT ")].Replace("1 DEAT ", "").Trim() == "Y")
{
//set death
I.Dead = true;
}
}
//add the Struct to the list for later use
Individuals.Add(I);
}
// Start Family section
else if (SubNode[0].Contains("FAM"))
{
//grab Fam id from node early on to keep from doing it over and over
string FamID = SubNode[0].Replace("# FAM", "");
// Multiple children can exist for each family so this section had to be a bit more dynaimic
// Look at each line of node
foreach (string Line in SubNode)
{
// If node is HUSB
if (Line.Contains("1 HUSB "))
{
FAM F = new FAM();
F.FamID = FamID;
F.type = "PAR";
F.IndiID = Line.Replace("1 HUSB ", "").Replace("#","").Trim();
Family.Add(F);
}
//If node for Wife
else if (Line.Contains("1 WIFE "))
{
FAM F = new FAM();
F.FamID = FamID;
F.type = "PAR";
F.IndiID = Line.Replace("1 WIFE ", "").Replace("#", "").Trim();
Family.Add(F);
}
//if node for multi children
else if (Line.Contains("1 CHIL "))
{
FAM F = new FAM();
F.FamID = FamID;
F.type = "CHIL";
F.IndiID = Line.Replace("1 CHIL ", "").Replace("#", "");
Family.Add(F);
}
}
}
}
}
private int FindIndexinArray(string[] Arr, string search)
{
int Val = -1;
for (int i = 0; i < Arr.Length; i++)
{
if (Arr[i].Contains(search))
{
Val = i;
}
}
return Val;
}

C# Sanitize File Name

I recently have been moving a bunch of MP3s from various locations into a repository. I had been constructing the new file names using the ID3 tags (thanks, TagLib-Sharp!), and I noticed that I was getting a System.NotSupportedException:
"The given path's format is not supported."
This was generated by either File.Copy() or Directory.CreateDirectory().
It didn't take long to realize that my file names needed to be sanitized. So I did the obvious thing:
public static string SanitizePath_(string path, char replaceChar)
{
string dir = Path.GetDirectoryName(path);
foreach (char c in Path.GetInvalidPathChars())
dir = dir.Replace(c, replaceChar);
string name = Path.GetFileName(path);
foreach (char c in Path.GetInvalidFileNameChars())
name = name.Replace(c, replaceChar);
return dir + name;
}
To my surprise, I continued to get exceptions. It turned out that ':' is not in the set of Path.GetInvalidPathChars(), because it is valid in a path root. I suppose that makes sense - but this has to be a pretty common problem. Does anyone have some short code that sanitizes a path? The most thorough I've come up with this, but it feels like it is probably overkill.
// replaces invalid characters with replaceChar
public static string SanitizePath(string path, char replaceChar)
{
// construct a list of characters that can't show up in filenames.
// need to do this because ":" is not in InvalidPathChars
if (_BadChars == null)
{
_BadChars = new List<char>(Path.GetInvalidFileNameChars());
_BadChars.AddRange(Path.GetInvalidPathChars());
_BadChars = Utility.GetUnique<char>(_BadChars);
}
// remove root
string root = Path.GetPathRoot(path);
path = path.Remove(0, root.Length);
// split on the directory separator character. Need to do this
// because the separator is not valid in a filename.
List<string> parts = new List<string>(path.Split(new char[]{Path.DirectorySeparatorChar}));
// check each part to make sure it is valid.
for (int i = 0; i < parts.Count; i++)
{
string part = parts[i];
foreach (char c in _BadChars)
{
part = part.Replace(c, replaceChar);
}
parts[i] = part;
}
return root + Utility.Join(parts, Path.DirectorySeparatorChar.ToString());
}
Any improvements to make this function faster and less baroque would be much appreciated.

To clean up a file name you could do this
private static string MakeValidFileName( string name )
{
string invalidChars = System.Text.RegularExpressions.Regex.Escape( new string( System.IO.Path.GetInvalidFileNameChars() ) );
string invalidRegStr = string.Format( #"([{0}]*\.+$)|([{0}]+)", invalidChars );
return System.Text.RegularExpressions.Regex.Replace( name, invalidRegStr, "_" );
}

A shorter solution:
var invalids = System.IO.Path.GetInvalidFileNameChars();
var newName = String.Join("_", origFileName.Split(invalids, StringSplitOptions.RemoveEmptyEntries) ).TrimEnd('.');

Based on Andre's excellent answer but taking into account Spud's comment on reserved words, I made this version:
/// <summary>
/// Strip illegal chars and reserved words from a candidate filename (should not include the directory path)
/// </summary>
/// <remarks>
/// http://stackoverflow.com/questions/309485/c-sharp-sanitize-file-name
/// </remarks>
public static string CoerceValidFileName(string filename)
{
var invalidChars = Regex.Escape(new string(Path.GetInvalidFileNameChars()));
var invalidReStr = string.Format(#"[{0}]+", invalidChars);
var reservedWords = new []
{
"CON", "PRN", "AUX", "CLOCK$", "NUL", "COM0", "COM1", "COM2", "COM3", "COM4",
"COM5", "COM6", "COM7", "COM8", "COM9", "LPT0", "LPT1", "LPT2", "LPT3", "LPT4",
"LPT5", "LPT6", "LPT7", "LPT8", "LPT9"
};
var sanitisedNamePart = Regex.Replace(filename, invalidReStr, "_");
foreach (var reservedWord in reservedWords)
{
var reservedWordPattern = string.Format("^{0}\\.", reservedWord);
sanitisedNamePart = Regex.Replace(sanitisedNamePart, reservedWordPattern, "_reservedWord_.", RegexOptions.IgnoreCase);
}
return sanitisedNamePart;
}
And these are my unit tests
[Test]
public void CoerceValidFileName_SimpleValid()
{
var filename = #"thisIsValid.txt";
var result = PathHelper.CoerceValidFileName(filename);
Assert.AreEqual(filename, result);
}
[Test]
public void CoerceValidFileName_SimpleInvalid()
{
var filename = #"thisIsNotValid\3\\_3.txt";
var result = PathHelper.CoerceValidFileName(filename);
Assert.AreEqual("thisIsNotValid_3__3.txt", result);
}
[Test]
public void CoerceValidFileName_InvalidExtension()
{
var filename = #"thisIsNotValid.t\xt";
var result = PathHelper.CoerceValidFileName(filename);
Assert.AreEqual("thisIsNotValid.t_xt", result);
}
[Test]
public void CoerceValidFileName_KeywordInvalid()
{
var filename = "aUx.txt";
var result = PathHelper.CoerceValidFileName(filename);
Assert.AreEqual("_reservedWord_.txt", result);
}
[Test]
public void CoerceValidFileName_KeywordValid()
{
var filename = "auxillary.txt";
var result = PathHelper.CoerceValidFileName(filename);
Assert.AreEqual("auxillary.txt", result);
}

string clean = String.Concat(dirty.Split(Path.GetInvalidFileNameChars()));

there are a lot of working solutions here. just for the sake of completeness, here's an approach that doesn't use regex, but uses LINQ:
var invalids = Path.GetInvalidFileNameChars();
filename = invalids.Aggregate(filename, (current, c) => current.Replace(c, '_'));
Also, it's a very short solution ;)

I'm using the System.IO.Path.GetInvalidFileNameChars() method to check invalid characters and I've got no problems.
I'm using the following code:
foreach( char invalidchar in System.IO.Path.GetInvalidFileNameChars())
{
filename = filename.Replace(invalidchar, '_');
}

I wanted to retain the characters in some way, not just simply replace the character with an underscore.
One way I thought was to replace the characters with similar looking characters which are (in my situation), unlikely to be used as regular characters. So I took the list of invalid characters and found look-a-likes.
The following are functions to encode and decode with the look-a-likes.
This code does not include a complete listing for all System.IO.Path.GetInvalidFileNameChars() characters. So it is up to you to extend or utilize the underscore replacement for any remaining characters.
private static Dictionary<string, string> EncodeMapping()
{
//-- Following characters are invalid for windows file and folder names.
//-- \/:*?"<>|
Dictionary<string, string> dic = new Dictionary<string, string>();
dic.Add(#"\", "Ì"); // U+OOCC
dic.Add("/", "Í"); // U+OOCD
dic.Add(":", "¦"); // U+00A6
dic.Add("*", "¤"); // U+00A4
dic.Add("?", "¿"); // U+00BF
dic.Add(#"""", "ˮ"); // U+02EE
dic.Add("<", "«"); // U+00AB
dic.Add(">", "»"); // U+00BB
dic.Add("|", "│"); // U+2502
return dic;
}
public static string Escape(string name)
{
foreach (KeyValuePair<string, string> replace in EncodeMapping())
{
name = name.Replace(replace.Key, replace.Value);
}
//-- handle dot at the end
if (name.EndsWith(".")) name = name.CropRight(1) + "°";
return name;
}
public static string UnEscape(string name)
{
foreach (KeyValuePair<string, string> replace in EncodeMapping())
{
name = name.Replace(replace.Value, replace.Key);
}
//-- handle dot at the end
if (name.EndsWith("°")) name = name.CropRight(1) + ".";
return name;
}
You can select your own look-a-likes. I used the Character Map app in windows to select mine %windir%\system32\charmap.exe
As I make adjustments through discovery, I will update this code.

I think the problem is that you first call Path.GetDirectoryName on the bad string. If this has non-filename characters in it, .Net can't tell which parts of the string are directories and throws. You have to do string comparisons.
Assuming it's only the filename that is bad, not the entire path, try this:
public static string SanitizePath(string path, char replaceChar)
{
int filenamePos = path.LastIndexOf(Path.DirectorySeparatorChar) + 1;
var sb = new System.Text.StringBuilder();
sb.Append(path.Substring(0, filenamePos));
for (int i = filenamePos; i < path.Length; i++)
{
char filenameChar = path[i];
foreach (char c in Path.GetInvalidFileNameChars())
if (filenameChar.Equals(c))
{
filenameChar = replaceChar;
break;
}
sb.Append(filenameChar);
}
return sb.ToString();
}

I have had success with this in the past.
Nice, short and static :-)
public static string returnSafeString(string s)
{
foreach (char character in Path.GetInvalidFileNameChars())
{
s = s.Replace(character.ToString(),string.Empty);
}
foreach (char character in Path.GetInvalidPathChars())
{
s = s.Replace(character.ToString(), string.Empty);
}
return (s);
}

Here's an efficient lazy loading extension method based on Andre's code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace LT
{
public static class Utility
{
static string invalidRegStr;
public static string MakeValidFileName(this string name)
{
if (invalidRegStr == null)
{
var invalidChars = System.Text.RegularExpressions.Regex.Escape(new string(System.IO.Path.GetInvalidFileNameChars()));
invalidRegStr = string.Format(#"([{0}]*\.+$)|([{0}]+)", invalidChars);
}
return System.Text.RegularExpressions.Regex.Replace(name, invalidRegStr, "_");
}
}
}

Your code would be cleaner if you appended the directory and filename together and sanitized that rather than sanitizing them independently. As for sanitizing away the :, just take the 2nd character in the string. If it is equal to "replacechar", replace it with a colon. Since this app is for your own use, such a solution should be perfectly sufficient.

using System;
using System.IO;
using System.Linq;
using System.Text;
public class Program
{
public static void Main()
{
try
{
var badString = "ABC\\DEF/GHI<JKL>MNO:PQR\"STU\tVWX|YZA*BCD?EFG";
Console.WriteLine(badString);
Console.WriteLine(SanitizeFileName(badString, '.'));
Console.WriteLine(SanitizeFileName(badString));
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
}
}
private static string SanitizeFileName(string fileName, char? replacement = null)
{
if (fileName == null) { return null; }
if (fileName.Length == 0) { return ""; }
var sb = new StringBuilder();
var badChars = Path.GetInvalidFileNameChars().ToList();
foreach (var #char in fileName)
{
if (badChars.Contains(#char))
{
if (replacement.HasValue)
{
sb.Append(replacement.Value);
}
continue;
}
sb.Append(#char);
}
return sb.ToString();
}
}

Based #fiat's and #Andre's approach, I'd like to share my solution too.
Main difference:
its an extension method
regex is compiled at first use to save some time with a lot executions
reserved words are preserved
public static class StringPathExtensions
{
private static Regex _invalidPathPartsRegex;
static StringPathExtensions()
{
var invalidReg = System.Text.RegularExpressions.Regex.Escape(new string(Path.GetInvalidFileNameChars()));
_invalidPathPartsRegex = new Regex($"(?<reserved>^(CON|PRN|AUX|CLOCK\\$|NUL|COM0|COM1|COM2|COM3|COM4|COM5|COM6|COM7|COM8|COM9|LPT0|LPT1|LPT2|LPT3|LPT4|LPT5|LPT6|LPT7|LPT8|LPT9))|(?<invalid>[{invalidReg}:]+|\\.$)", RegexOptions.Compiled);
}
public static string SanitizeFileName(this string path)
{
return _invalidPathPartsRegex.Replace(path, m =>
{
if (!string.IsNullOrWhiteSpace(m.Groups["reserved"].Value))
return string.Concat("_", m.Groups["reserved"].Value);
return "_";
});
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to read .txt and count word/length, etc - c#

Related

C#: read text file separated by additional newline character

Extracting specific part of a text file in C#

Searching Specific Data From a File

Reading a line from a streamreader without consuming?

C# Sanitize File Name

Categories

Resources