Problems with using Regex but don't know why - c#

so i am making an WPF application where you insert PDF files and it will convert to text, after that a few Regex functions will be used on the text to give me only the important parts of the pdf.
the first problem i am running into is with numbers, if the number for example is 6.90 it will come out as 6.9. I have tried changing my Regex but it wont make a difference.
the second problem i have is when with dates for example 09-06-2022 it just wont write anything i have also tried changing the Regex but it just wont show up.
anyone know why this is ?
this is a line in the PDF i use i am trying to only get 6.90
Date: 06-09-2022 € 5.70 € 1.20 € 6.90
this is the Regex is use to only get the Amount
(?<=Date\:?\s?\s?\s?\d{0,2}\-\d{0,2}\-\d{0,4}\s?\€\s\d{0,10}\.?\,?\d{0,2}\s?\€\s\d{0,10}\,?\.?\d{0,10}\s?\€\s)\d{0,10}\.\d{0,2}
this is the Regex i use to only get the Date
(?<=Date\:?\s?\s?\s?)\d{0,2}\-\d{0,2}\-\d{0,4}
There are a lot of "?" in it because i have to make it compatible to multiple different PDF
screenshot of the outcome for the number in my selfmade Regex executor application
screenshot of the outcome for the date in my selfmade Regex executor application
screenshot of the outcome i get when i inserted a PDF
as you can see in the screenshots for some reason i get different results and i have no clue why its different
MainWindow
the button does all the work for recieving the pdf and changing it to text and going thru the correct class where all the regex are.
using Microsoft.Win32;
using System;
using System.Collections.Generic;
using System.IO;
using System.Windows;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
//ItextSharp is a tool i use in Visual Studio
public partial class MainWindow : Window
{
private List<IRegexPDFFactuur> _listRegexFactuur = new
List<IRegexPDFFactuur>();
public MainWindow()
{
InitializeComponent();
}
public void btnUpload_Click(object sender, EventArgs e)
{
var openFileDialog = new OpenFileDialog();
if (openFileDialog.ShowDialog() == true)
{
tbInvoer.Text = "";
var file = openFileDialog.FileName;
var text = File.ReadAllText(file);
PdfReader pdf_Reader = new PdfReader(file);
String tempPDFText = "";
for (int i = 1; i <= pdf_Reader.NumberOfPages; i++)
{
tempPDFText = tempPDFText +
PdfTextExtractor.GetTextFromPage(pdf_Reader, i);
}
var PDFText = tempPDFText;
_listRegexFactuur.Add(new PDFtest1Type());
foreach (var tempRegexFactuurType in _listRegexFactuur)
{
if
(tempRegexFactuurType.IsRegexTypeValidForPDF(PDFText))
{
var tempPDFdate = tempRegexFactuurType.GetPDFdate(PDFText);
var tempTotalamount = tempRegexFactuurType.GetTotalamount(PDFText);
tbInvoer.Text += $"PDF Date: {tempPDFdate}\r\n";
tbInvoer.Text += $"Total amount: {tempTotalamount}";
break;
}
}
}
}
}
Interface for Regex
string regexPDFname { get; set; }
string regexPDFdate { get; set; }
string regexTotalamount { get; set; }
bool IsRegexTypeValidForPDF(string argInput);
double? GetPDFdate(string argInput);
double? GetTotalamount(string argInput);
Class with implemented Interface for Regex
public string regexPDFname { get; set; } = #"(PDFtest1)";
public string regexPDFdate { get; set; } = #"(?<=Date\:?\s?\s?\s?)\d{0,2}\-\d{0,2}\-\d{0,4}";
public string regexTotalamount { get; set; } = #"(?<=Date\:?\s?\s?\s?\d{0,2}\-\d{0,2}\-\d{0,4}\s?\€\s\d{0,10}\.?\,?\d{0,2}\s?\€\s\d{0,10}\,?\.?\d{0,10}\s?\€\s)\d{0,10}\.\d{0,2}"
public bool IsRegexTypeValidForPDF(string argInput)
{
var tempMatch = Regex.Match(argInput, regexPDFname, RegexOptions.IgnoreCase);
if (!tempMatch.Success) return false;
if (tempMatch.Value == "PDFtest1") return true;
else return false;
}
public double? GetPDFdate(string argInput)
{
var tempMatch = Regex.Match(argInput, regexPDFdate, RegexOptions.IgnoreCase);
if (!tempMatch.Success) return null;
if (Double.TryParse(tempMatch.Value, out var tempPDFdate)) return tempPDFdate;
else return null;
}
public double? GetTotalamount(string argInput)
{
var tempMatch = Regex.Match(argInput, regexTotalamount, RegexOptions.IgnoreCase);
if (!tempMatch.Success) return null;
if (Double.TryParse(tempMatch.Value, out var tempTotalamount)) return tempTotalamount;
else return null;
}

This is much easier without Regex
string input = "Date: 06-09-2022 € 5.70 € 1.20 € 6.90";
string[] array = input.Split(new char[] {':', '€'});
DateTime date = DateTime.Parse(array[1]);
decimal amount1 = decimal.Parse(array[2]);
decimal amount2 = decimal.Parse(array[3]);
decimal amount3 = decimal.Parse(array[4]);

If you still want to use Regex, this is a much simpler solution
Date\:\s{0,}(\d{1,2}-?\d{1,2}-?\d{2}(?:\d{2})?).+(\d+\.\d+).+(\d+\.\d+).+(\d+\.\d+)
Breakdown
Date\:\s{0,} matches Date: followed by 0 or more spaces
(\d{1,2}-?\d{1,2}-?\d{2,4}) matches your date string accepting 1 or 2 numbers for month and day and 2 or 4 for year
.+(\d+\.\d+) matches any characters until it matches 1 or more numbers followed by . and 1 or more numbers. This is repeated 3 times to obtain the currency values
RegEx Storm Example

Related

How to Get index of a Character in an Unknown Line of a Multiline string in c#

I'm trying to get covid-19 results (only information about Iran) from an Api and show it on a textbox.
and the full result (all countries) that i get from the Api is a json format.
so to get only Iran section i made a Function that loops through lines of the string one by one and check if in that line there is a "{" and if yes get index of that and continue checking if in another line there is a "}" and get index of that too then check if between these, there is "Iran" then add this text (from "{" to "}") in a string:
private string getBetween(string strSourceText, string strStartingPosition, string strEndingPosition)
{
int Starting_CurlyBracket_Index = 0;
int Ending_CurlyBracket_Index = 0;
string FinalText = null;
bool isTurnTo_firstIf = true;
foreach (var line in strSourceText.Split('\r', '\n'))
{
if (isTurnTo_firstIf == true)
{
if (line.Contains(strStartingPosition))
{
Starting_CurlyBracket_Index = line.IndexOf(strStartingPosition); //i think problem is here
isTurnTo_firstIf = false;
}
}
else if (isTurnTo_firstIf == false)
{
if (line.Contains(strEndingPosition))
{
Ending_CurlyBracket_Index = line.IndexOf(strEndingPosition); //i think problem is here
if (strSourceText.Substring(Starting_CurlyBracket_Index, Ending_CurlyBracket_Index - Starting_CurlyBracket_Index).Contains("Iran")) //error here
{
FinalText = strSourceText.Substring(Starting_CurlyBracket_Index, Ending_CurlyBracket_Index - Starting_CurlyBracket_Index);
break;
}
else
{
isTurnTo_firstIf = true;
}
}
}
}
return FinalText;
}
and i call the function like this:
string OnlyIranSection = getBetween(Sorted_Covid19_Result, "{", "}"); //Sorted_Covid19_Result is the full result in json format that converted to string
textBox1.Text = OnlyIranSection;
but i get this Error:
and i know.. its because it gets indexes in the current line but what i need is getting that index in the strSourceText so i can show only this section of the whole result:
USING JSON
As per the comments I read it was really needed to use JSON utility to achieve your needs easier.
You can start with this basic example:
static void Main(string[] args)
{
string jsonString = #"{
""results"": [
{""continent"":""Asia"",""country"":""Indonesia""},
{""continent"":""Asia"",""country"":""Iran""},
{""continent"":""Asia"",""country"":""Philippines""}
]
}";
var result = JsonConvert.DeserializeObject<JsonResult>(jsonString);
var iranInfo = result.InfoList.Where(i => i.Country.ToString() == "Iran").FirstOrDefault();
}
public class JsonResult
{
[JsonProperty("results")]
public List<Info> InfoList { get; set; }
}
public class Info
{
public object Continent { get; set; }
public object Country { get; set; }
}
UPDATE: USING INDEX
As long as the structure of the JSON is consistent always then this kind of sample solution can give you hint.
Console.WriteLine("Original JSON:");
Console.WriteLine(jsonString);
Console.WriteLine();
Console.WriteLine("Step1: Make the json as single line,");
jsonString = jsonString.Replace(" ", "").Replace(Environment.NewLine, " ");
Console.WriteLine(jsonString);
Console.WriteLine();
Console.WriteLine("Step2: Get index of country Iran. And use that index to get the below output using substring.");
var iranIndex = jsonString.ToLower().IndexOf(#"""country"":""iran""");
var iranInitialInfo = jsonString.Substring(iranIndex);
Console.WriteLine(iranInitialInfo);
Console.WriteLine();
Console.WriteLine("Step3: Get inedx of continent. And use that index to get below output using substring.");
var continentIndex = iranInitialInfo.IndexOf(#"""continent"":");
iranInitialInfo = iranInitialInfo.Substring(0, continentIndex-3);
Console.WriteLine(iranInitialInfo);
Console.WriteLine();
Console.WriteLine("Step4: Get the first part of the info by using. And combine it with the initialInfo to bring the output below.");
var beginningIranInfo = jsonString.Substring(0, iranIndex);
var lastOpenCurlyBraceIndex = beginningIranInfo.LastIndexOf("{");
beginningIranInfo = beginningIranInfo.Substring(lastOpenCurlyBraceIndex);
var iranInfo = beginningIranInfo + iranInitialInfo;
Console.WriteLine(iranInfo);
OUTPUT USING INDEX:

Converting log file to CSV

I have to convert a (Squid Web Proxy Server) log file to CSV file, so that it can be loaded into powerpivot for analysis of queries.
So how should I start, any help would strongly be appreciated.
I've to use C# language for this task, log looks like the following:
Format: Timestamp Elapsed Client Action/Code Size Method URI Ident Hierarchy/From Content
1473546438.145 917 5.45.107.68 TCP_DENIED/403 4114 GET http://atlantis.pennergame.de/pet/ - NONE/- text/html
1473546439.111 3 146.148.96.13 TCP_DENIED/403 4604 POST http://mobiuas.ebay.com/services/mobile/v1/UserAuthenticationService - NONE/- text/html
1473546439.865 358 212.83.168.7 TCP_DENIED/403 3955 GET http://www.theshadehouse.com/left-sidebar-post/ - NONE/- text/html
1473546439.985 218 185.5.97.68 TCP_DENIED/403 3600 GET http://www.google.pl/search? - NONE/- text/html
1473546440.341 2 146.148.96.13 TCP_DENIED/403 4604 POST http://mobiuas.ebay.com/services/mobile/v1/UserAuthenticationService - NONE/- text/html
1473546440.840 403 115.29.46.240 TCP_DENIED/403 4430 POST http://et.airchina.com.cn/fhx/consumeRecord/getCardConsumeRecordList.htm - NONE/- text/html
1473546441.486 2 52.41.27.39 TCP_DENIED/403 3813 POST http://www.deezer.com/ajax/action.php - NONE/- text/html
1473546441.596 2 146.148.96.13 TCP_DENIED/403 4604 POST http://mobiuas.ebay.com/services/mobile/v1/UserAuthenticationService - NONE/- text/html
It is already close to a CSV, so read it line by line and clean each line up a little:
...
line = line
.Replace(" ", " ") // compress 3 spaces to 1
.Replace(" ", " ") // compress 2 spaces to 1
.Replace(" ", " ") // compress 2 spaces to 1, again
.Replace(" ", "|") // replace space by '|'
.Replace(" - ", "|"); // replace - by '|'
You may want to tweak this for the fields like TCP_DENIED/403 .
this gives you a '|' separated line. Easy to convert to any separator you need. Or split it up:
// write it out or process it further
string[] parts = line.split('|');
public static class SquidWebProxyServerCommaSeparatedWriter
{
public static void WriteToCSV(string destination, IEnumerable<SquidWebProxyServerLogEntry> serverLogEntries)
{
var lines = serverLogEntries.Select(ConvertToLine);
File.WriteAllLines(destination, lines);
}
private static string ConvertToLine(SquidWebProxyServerLogEntry serverLogEntry)
{
return string.Join(#",", serverLogEntry.Timestamp, serverLogEntry.Elapsed.ToString(),
serverLogEntry.ClientIPAddress, serverLogEntry.ActionCode, serverLogEntry.Size.ToString(),
serverLogEntry.Method.ToString(), serverLogEntry.Uri, serverLogEntry.Identity,
serverLogEntry.HierarchyFrom, serverLogEntry.MimeType);
}
}
public static class SquidWebProxyServerLogParser
{
public static IEnumerable<SquidWebProxyServerLogEntry> Parse(FileInfo fileInfo)
{
using (var streamReader = fileInfo.OpenText())
{
string row;
while ((row = streamReader.ReadLine()) != null)
{
yield return ParseRow(row)
}
}
}
private static SquidWebProxyServerLogEntry ParseRow(string row)
{
var fields = row.Split(new[] {"\t", " "}, StringSplitOptions.None);
return new SquidWebProxyServerLogEntry
{
Timestamp = fields[0],
Elapsed = int.Parse(fields[1]),
ClientIPAddress = fields[2],
ActionCode = fields[3],
Size = int.Parse(fields[4]),
Method =
(SquidWebProxyServerLogEntry.MethodType)
Enum.Parse(typeof(SquidWebProxyServerLogEntry.MethodType), fields[5]),
Uri = fields[6],
Identity = fields[7],
HierarchyFrom = fields[8],
MimeType = fields[9]
};
}
public static IEnumerable<SquidWebProxyServerLogEntry> Parse(IEnumerable<string> rows) => rows.Select(ParseRow);
}
public sealed class SquidWebProxyServerLogEntry
{
public enum MethodType
{
Get = 0,
Post = 1,
Put = 2
}
public string Timestamp { get; set; }
public int Elapsed { get; set; }
public string ClientIPAddress { get; set; }
public string ActionCode { get; set; }
public int Size { get; set; }
public MethodType Method { get; set; }
public string Uri { get; set; }
public string Identity { get; set; }
public string HierarchyFrom { get; set; }
public string MimeType { get; set; }
}
A CSV is a delimited file whose field delimiter is ,. Almost all programs allow you to specify different field and record delimiters, using , and \n as defaults.
Your file could be treated as delimited if it didn't contain multiple spaces for indentation. You can replace multiple spaces with a single one using the regex \s{2,}, eg:
var regex=new Regex(#"\s{2,}");
var original=File.ReadAllText(somePath);
var delimited=regex.Replace(original," ");
File.WriteAllText(somePath,delimited);
Power BI Desktop already allows you to use space as a delimiter. Even if it didn't, you could just replace all spaces with a comma by changing the pattern to \s+, ie:
var regex=new Regex(#"\s+");
...
var delimited=regex.Replace(original,",");
...
Log files are large, so it's a very good idea to reduce the amount of memory they use. You can avoid reading the entire file in memory if you use ReadLines to read one line at a time, make the replacement and write it out:
using(var writer=File.CreateText(targetPath))
{
foreach(var line in File.ReadLines(somePath))
{
var newline=regex.Replace(line," ");
writer.WriteLine(newline);
}
}
Unlike ReadAllLines which loads all lines in an array, ReadLines is an iterator that reads and returns one line at a time.

Regex get group block with specific start and end each group

If we had some string like :
----------DBVer=1
/*some sql script*/
----------DBVer=1
----------DBVer=2
/*some sql script*/
----------DBVer=2
----------DBVer=n
/*some sql script*/
----------DBVer=n
Can we extract scripts between first DBVer=1 and second DBVer=1 and so on... with regex?
I thing we must have some placehoder for regex, and tel regex engine if saw DBVer=digitA pick string until DBVer=digitA again if saw DBVer=digitB pick string until DBVer=digitB and so on...
Can we implement this with regex and if we can how?
Yes, using backreferences and lookarounds, you can capture the scripts:
var pattern = #"(?<=(?<m>-{10}DBVer=\d+)\r?\n).*(?=\r?\n\k<m>)";
var scripts = Regex.Matches(input, pattern, RegexOptions.Singleline)
.Cast<Match>()
.Select(m => m.Value);
Here, we capture the m (marker) group with (?<m>-{10}DBVer=\d+) and reuse the m value later in the regex with \k<m> to match against the end marker.
In order for .* to match newline chars, it is necessary to turn on Singleline mode. This, in turn, means we have to be specific about our newlines. In Singleline mode, these can be accounted for in a non-platform specific way with \r?\n.
Try code below. Not RegEx but works very well.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Text.RegularExpressions;
namespace ConsoleApplication6
{
class Program
{
const string FILENAME = #"c:\temp\test.txt";
static void Main(string[] args)
{
Script.ReadScripts(FILENAME);
}
}
public class Script
{
enum State
{
Get_Script,
Read_Script
}
public static List<Script> scripts = new List<Script>();
public int version { get; set; }
public string script { get; set; }
public static void ReadScripts(string filename)
{
string inputLine = "";
string pattern = "DBVer=(?'version'\\d+)";
State state = State.Get_Script;
StreamReader reader = new StreamReader(filename);
Script newScript = null;
while ((inputLine = reader.ReadLine()) != null)
{
inputLine = inputLine.Trim();
if (inputLine.Length > 0)
{
switch (state)
{
case State.Get_Script :
if(inputLine.StartsWith("-----"))
{
newScript = new Script();
scripts.Add(newScript);
string version =
Regex.Match(inputLine, pattern).Groups["version"].Value;
newScript.version = int.Parse(version);
newScript.script = "";
state = State.Read_Script;
}
break;
case State.Read_Script :
if (inputLine.StartsWith("-----"))
{
state = State.Get_Script;
}
else
{
if (newScript.script.Length == 0)
{
newScript.script = inputLine;
}
else
{
newScript.script += "\n" + inputLine;
}
}
break;
}
}
}
}
}
}

Searching Specific Data From a File

I have a File having text and few numbers.I just want to extract numbers from it.How do I go about it ???
I tried using all that split thing but no luck so far.
My File is like this:
AT+CMGL="ALL"
+CMGL: 5566,"REC READ","Ufone"
Dear customer, your DAY_BUCKET subscription will expire on 02/05/09
+CMGL: 5565,"REC READ","+923466666666"
KINDLY TELL ME THE WAY TO EXTRACT NUMBERS LIKE +923466666666 from this File so I can put them into another File or textbox.
Thanks
Here's an example using the String.Split. The "number" contains a '+', so really it should be treated as a string not a number. I'm presuming it's a telephone number with the '+' potentially used for international calls? If it is a telephone number, you need to be careful of dashes, spaces in the number as well as extension numbers added to the end eg "+9234 666-66666 ext 235" and so on...
Anyway - hopefully the example is useful in getting to grips with Split.
The code include unit tests using NUnit v2.4.8
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using NUnit.Framework;
using System.Text.RegularExpressions;
namespace SO.NumberExtractor.Test
{
public class NumberExtracter
{
public List<string> ExtractNumbers(string lines)
{
List<string> numbers = new List<string>();
string[] seperator = { System.Environment.NewLine };
string[] seperatedLines = lines.Split(seperator, StringSplitOptions.RemoveEmptyEntries);
foreach (string line in seperatedLines)
{
string s = ExtractNumber(line);
numbers.Add(s);
}
return numbers;
}
public string ExtractNumber(string line)
{
string s = line.Split(',').Last<string>().Trim('"');
return s;
}
public string ExtractNumberWithoutLinq(string line)
{
string[] fields = line.Split(',');
string s = fields[fields.Length - 1];
s = s.Trim('"');
return s;
}
}
[TestFixture]
public class NumberExtracterTest
{
private readonly string LINE1 = "AT+CMGL=\"ALL\" +CMGL: 5566,\"REC READ\",\"Ufone\" Dear customer, your DAY_BUCKET subscription will expire on 02/05/09 +CMGL: 5565,\"REC READ\",\"+923466666666\"";
private readonly string LINE2 = "AT+CMGL=\"ALL\" +CMGL: 5566,\"REC READ\",\"Ufone\" Dear customer, your DAY_BUCKET subscription will expire on 02/05/09 +CMGL: 5565,\"REC READ\",\"+923466666667\"";
private readonly string LINE3 = "AT+CMGL=\"ALL\" +CMGL: 5566,\"REC READ\",\"Ufone\" Dear customer, your DAY_BUCKET subscription will expire on 02/05/09 +CMGL: 5565,\"REC READ\",\"+923466666668\"";
[Test]
public void ExtractOneLineWithoutLinq()
{
string expected = "+923466666666";
NumberExtracter c = new NumberExtracter();
string result = c.ExtractNumberWithoutLinq(LINE1);
Assert.AreEqual(expected, result);
}
[Test]
public void ExtractOneLineUsingLinq()
{
string expected = "+923466666666";
NumberExtracter c = new NumberExtracter();
string result = c.ExtractNumber(LINE1);
Assert.AreEqual(expected, result);
}
[Test]
public void ExtractMultipleLines()
{
StringBuilder sb = new StringBuilder();
sb.AppendLine(LINE1);
sb.AppendLine(LINE2);
sb.AppendLine(LINE3);
NumberExtracter ne = new NumberExtracter();
List<string> extractedNumbers = ne.ExtractNumbers(sb.ToString());
string expectedFirst = "+923466666666";
string expectedSecond = "+923466666667";
string expectedThird = "+923466666668";
Assert.AreEqual(expectedFirst, extractedNumbers[0]);
Assert.AreEqual(expectedSecond, extractedNumbers[1]);
Assert.AreEqual(expectedThird, extractedNumbers[2]);
}
}
}
If the numbers are all at the end of the lines then you can use code like the following
foreach ( string line in File.ReadAllLines(#"c:\path\to\file.txt") ) {
Match result = Regex.Match(line, #"\+(\d+)""$");
if ( result.Success ) {
var number = result.Groups[1].Value;
// do what you want with the number
}
}
How large is the file? If the file is under a few megabytes in size I would recommend loading the file contents into a string and using a compiled regular expression to extract matches.
Here's a quick example:
Regex NumberExtractor = new Regex("[0-9]{7,16}",RegexOptions.Compiled);
/// <summary>
/// Extracts numbers between seven and sixteen digits long from the target file.
/// Example number to be extracted: +923466666666
/// </summary>
/// <param name="TargetFilePath"></param>
/// <returns>List of the matching numbers</returns>
private IEnumerable<ulong> ExtractLongNumbersFromFile(string TargetFilePath)
{
if (String.IsNullOrEmpty(TargetFilePath))
throw new ArgumentException("TargetFilePath is null or empty.", "TargetFilePath");
if (File.Exists(TargetFilePath) == false)
throw new Exception("Target file does not exist!");
FileStream TargetFileStream = null;
StreamReader TargetFileStreamReader = null;
string FileContents = "";
List<ulong> ReturnList = new List<ulong>();
try
{
TargetFileStream = new FileStream(TargetFilePath, FileMode.Open);
TargetFileStreamReader = new StreamReader(TargetFileStream);
FileContents = TargetFileStreamReader.ReadToEnd();
MatchCollection Matches = NumberExtractor.Matches(FileContents);
foreach (Match CurrentMatch in Matches) {
ReturnList.Add(System.Convert.ToUInt64(CurrentMatch.Value));
}
}
catch (Exception ex)
{
//Your logging, etc...
}
finally
{
if (TargetFileStream != null) {
TargetFileStream.Close();
TargetFileStream.Dispose();
}
if (TargetFileStreamReader != null)
{
TargetFileStreamReader.Dispose();
}
}
return (IEnumerable<ulong>)ReturnList;
}
Sample Usage:
List<ulong> Numbers = (List<ulong>)ExtractLongNumbersFromFile(#"v:\TestExtract.txt");

C# Sanitize File Name

I recently have been moving a bunch of MP3s from various locations into a repository. I had been constructing the new file names using the ID3 tags (thanks, TagLib-Sharp!), and I noticed that I was getting a System.NotSupportedException:
"The given path's format is not supported."
This was generated by either File.Copy() or Directory.CreateDirectory().
It didn't take long to realize that my file names needed to be sanitized. So I did the obvious thing:
public static string SanitizePath_(string path, char replaceChar)
{
string dir = Path.GetDirectoryName(path);
foreach (char c in Path.GetInvalidPathChars())
dir = dir.Replace(c, replaceChar);
string name = Path.GetFileName(path);
foreach (char c in Path.GetInvalidFileNameChars())
name = name.Replace(c, replaceChar);
return dir + name;
}
To my surprise, I continued to get exceptions. It turned out that ':' is not in the set of Path.GetInvalidPathChars(), because it is valid in a path root. I suppose that makes sense - but this has to be a pretty common problem. Does anyone have some short code that sanitizes a path? The most thorough I've come up with this, but it feels like it is probably overkill.
// replaces invalid characters with replaceChar
public static string SanitizePath(string path, char replaceChar)
{
// construct a list of characters that can't show up in filenames.
// need to do this because ":" is not in InvalidPathChars
if (_BadChars == null)
{
_BadChars = new List<char>(Path.GetInvalidFileNameChars());
_BadChars.AddRange(Path.GetInvalidPathChars());
_BadChars = Utility.GetUnique<char>(_BadChars);
}
// remove root
string root = Path.GetPathRoot(path);
path = path.Remove(0, root.Length);
// split on the directory separator character. Need to do this
// because the separator is not valid in a filename.
List<string> parts = new List<string>(path.Split(new char[]{Path.DirectorySeparatorChar}));
// check each part to make sure it is valid.
for (int i = 0; i < parts.Count; i++)
{
string part = parts[i];
foreach (char c in _BadChars)
{
part = part.Replace(c, replaceChar);
}
parts[i] = part;
}
return root + Utility.Join(parts, Path.DirectorySeparatorChar.ToString());
}
Any improvements to make this function faster and less baroque would be much appreciated.
To clean up a file name you could do this
private static string MakeValidFileName( string name )
{
string invalidChars = System.Text.RegularExpressions.Regex.Escape( new string( System.IO.Path.GetInvalidFileNameChars() ) );
string invalidRegStr = string.Format( #"([{0}]*\.+$)|([{0}]+)", invalidChars );
return System.Text.RegularExpressions.Regex.Replace( name, invalidRegStr, "_" );
}
A shorter solution:
var invalids = System.IO.Path.GetInvalidFileNameChars();
var newName = String.Join("_", origFileName.Split(invalids, StringSplitOptions.RemoveEmptyEntries) ).TrimEnd('.');
Based on Andre's excellent answer but taking into account Spud's comment on reserved words, I made this version:
/// <summary>
/// Strip illegal chars and reserved words from a candidate filename (should not include the directory path)
/// </summary>
/// <remarks>
/// http://stackoverflow.com/questions/309485/c-sharp-sanitize-file-name
/// </remarks>
public static string CoerceValidFileName(string filename)
{
var invalidChars = Regex.Escape(new string(Path.GetInvalidFileNameChars()));
var invalidReStr = string.Format(#"[{0}]+", invalidChars);
var reservedWords = new []
{
"CON", "PRN", "AUX", "CLOCK$", "NUL", "COM0", "COM1", "COM2", "COM3", "COM4",
"COM5", "COM6", "COM7", "COM8", "COM9", "LPT0", "LPT1", "LPT2", "LPT3", "LPT4",
"LPT5", "LPT6", "LPT7", "LPT8", "LPT9"
};
var sanitisedNamePart = Regex.Replace(filename, invalidReStr, "_");
foreach (var reservedWord in reservedWords)
{
var reservedWordPattern = string.Format("^{0}\\.", reservedWord);
sanitisedNamePart = Regex.Replace(sanitisedNamePart, reservedWordPattern, "_reservedWord_.", RegexOptions.IgnoreCase);
}
return sanitisedNamePart;
}
And these are my unit tests
[Test]
public void CoerceValidFileName_SimpleValid()
{
var filename = #"thisIsValid.txt";
var result = PathHelper.CoerceValidFileName(filename);
Assert.AreEqual(filename, result);
}
[Test]
public void CoerceValidFileName_SimpleInvalid()
{
var filename = #"thisIsNotValid\3\\_3.txt";
var result = PathHelper.CoerceValidFileName(filename);
Assert.AreEqual("thisIsNotValid_3__3.txt", result);
}
[Test]
public void CoerceValidFileName_InvalidExtension()
{
var filename = #"thisIsNotValid.t\xt";
var result = PathHelper.CoerceValidFileName(filename);
Assert.AreEqual("thisIsNotValid.t_xt", result);
}
[Test]
public void CoerceValidFileName_KeywordInvalid()
{
var filename = "aUx.txt";
var result = PathHelper.CoerceValidFileName(filename);
Assert.AreEqual("_reservedWord_.txt", result);
}
[Test]
public void CoerceValidFileName_KeywordValid()
{
var filename = "auxillary.txt";
var result = PathHelper.CoerceValidFileName(filename);
Assert.AreEqual("auxillary.txt", result);
}
string clean = String.Concat(dirty.Split(Path.GetInvalidFileNameChars()));
there are a lot of working solutions here. just for the sake of completeness, here's an approach that doesn't use regex, but uses LINQ:
var invalids = Path.GetInvalidFileNameChars();
filename = invalids.Aggregate(filename, (current, c) => current.Replace(c, '_'));
Also, it's a very short solution ;)
I'm using the System.IO.Path.GetInvalidFileNameChars() method to check invalid characters and I've got no problems.
I'm using the following code:
foreach( char invalidchar in System.IO.Path.GetInvalidFileNameChars())
{
filename = filename.Replace(invalidchar, '_');
}
I wanted to retain the characters in some way, not just simply replace the character with an underscore.
One way I thought was to replace the characters with similar looking characters which are (in my situation), unlikely to be used as regular characters. So I took the list of invalid characters and found look-a-likes.
The following are functions to encode and decode with the look-a-likes.
This code does not include a complete listing for all System.IO.Path.GetInvalidFileNameChars() characters. So it is up to you to extend or utilize the underscore replacement for any remaining characters.
private static Dictionary<string, string> EncodeMapping()
{
//-- Following characters are invalid for windows file and folder names.
//-- \/:*?"<>|
Dictionary<string, string> dic = new Dictionary<string, string>();
dic.Add(#"\", "Ì"); // U+OOCC
dic.Add("/", "Í"); // U+OOCD
dic.Add(":", "¦"); // U+00A6
dic.Add("*", "¤"); // U+00A4
dic.Add("?", "¿"); // U+00BF
dic.Add(#"""", "ˮ"); // U+02EE
dic.Add("<", "«"); // U+00AB
dic.Add(">", "»"); // U+00BB
dic.Add("|", "│"); // U+2502
return dic;
}
public static string Escape(string name)
{
foreach (KeyValuePair<string, string> replace in EncodeMapping())
{
name = name.Replace(replace.Key, replace.Value);
}
//-- handle dot at the end
if (name.EndsWith(".")) name = name.CropRight(1) + "°";
return name;
}
public static string UnEscape(string name)
{
foreach (KeyValuePair<string, string> replace in EncodeMapping())
{
name = name.Replace(replace.Value, replace.Key);
}
//-- handle dot at the end
if (name.EndsWith("°")) name = name.CropRight(1) + ".";
return name;
}
You can select your own look-a-likes. I used the Character Map app in windows to select mine %windir%\system32\charmap.exe
As I make adjustments through discovery, I will update this code.
I think the problem is that you first call Path.GetDirectoryName on the bad string. If this has non-filename characters in it, .Net can't tell which parts of the string are directories and throws. You have to do string comparisons.
Assuming it's only the filename that is bad, not the entire path, try this:
public static string SanitizePath(string path, char replaceChar)
{
int filenamePos = path.LastIndexOf(Path.DirectorySeparatorChar) + 1;
var sb = new System.Text.StringBuilder();
sb.Append(path.Substring(0, filenamePos));
for (int i = filenamePos; i < path.Length; i++)
{
char filenameChar = path[i];
foreach (char c in Path.GetInvalidFileNameChars())
if (filenameChar.Equals(c))
{
filenameChar = replaceChar;
break;
}
sb.Append(filenameChar);
}
return sb.ToString();
}
I have had success with this in the past.
Nice, short and static :-)
public static string returnSafeString(string s)
{
foreach (char character in Path.GetInvalidFileNameChars())
{
s = s.Replace(character.ToString(),string.Empty);
}
foreach (char character in Path.GetInvalidPathChars())
{
s = s.Replace(character.ToString(), string.Empty);
}
return (s);
}
Here's an efficient lazy loading extension method based on Andre's code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace LT
{
public static class Utility
{
static string invalidRegStr;
public static string MakeValidFileName(this string name)
{
if (invalidRegStr == null)
{
var invalidChars = System.Text.RegularExpressions.Regex.Escape(new string(System.IO.Path.GetInvalidFileNameChars()));
invalidRegStr = string.Format(#"([{0}]*\.+$)|([{0}]+)", invalidChars);
}
return System.Text.RegularExpressions.Regex.Replace(name, invalidRegStr, "_");
}
}
}
Your code would be cleaner if you appended the directory and filename together and sanitized that rather than sanitizing them independently. As for sanitizing away the :, just take the 2nd character in the string. If it is equal to "replacechar", replace it with a colon. Since this app is for your own use, such a solution should be perfectly sufficient.
using System;
using System.IO;
using System.Linq;
using System.Text;
public class Program
{
public static void Main()
{
try
{
var badString = "ABC\\DEF/GHI<JKL>MNO:PQR\"STU\tVWX|YZA*BCD?EFG";
Console.WriteLine(badString);
Console.WriteLine(SanitizeFileName(badString, '.'));
Console.WriteLine(SanitizeFileName(badString));
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
}
}
private static string SanitizeFileName(string fileName, char? replacement = null)
{
if (fileName == null) { return null; }
if (fileName.Length == 0) { return ""; }
var sb = new StringBuilder();
var badChars = Path.GetInvalidFileNameChars().ToList();
foreach (var #char in fileName)
{
if (badChars.Contains(#char))
{
if (replacement.HasValue)
{
sb.Append(replacement.Value);
}
continue;
}
sb.Append(#char);
}
return sb.ToString();
}
}
Based #fiat's and #Andre's approach, I'd like to share my solution too.
Main difference:
its an extension method
regex is compiled at first use to save some time with a lot executions
reserved words are preserved
public static class StringPathExtensions
{
private static Regex _invalidPathPartsRegex;
static StringPathExtensions()
{
var invalidReg = System.Text.RegularExpressions.Regex.Escape(new string(Path.GetInvalidFileNameChars()));
_invalidPathPartsRegex = new Regex($"(?<reserved>^(CON|PRN|AUX|CLOCK\\$|NUL|COM0|COM1|COM2|COM3|COM4|COM5|COM6|COM7|COM8|COM9|LPT0|LPT1|LPT2|LPT3|LPT4|LPT5|LPT6|LPT7|LPT8|LPT9))|(?<invalid>[{invalidReg}:]+|\\.$)", RegexOptions.Compiled);
}
public static string SanitizeFileName(this string path)
{
return _invalidPathPartsRegex.Replace(path, m =>
{
if (!string.IsNullOrWhiteSpace(m.Groups["reserved"].Value))
return string.Concat("_", m.Groups["reserved"].Value);
return "_";
});
}
}

Categories

Resources