Remove Non-ASCII characters from XML file C#

Remove Non-ASCII characters from XML file C# - c#

I am trying to write a program to remove open an XML file with Non-ASCII characters and replace those characters with spaces and save and close the file.
Thats basically it, just open the file remove all the non ascii characters and save/close the file.
Here is my code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.IO;
using System.Text.RegularExpressions;
namespace RemoveSpecial
{
class Program
{
static void Main(string[] args)
{
string pth_input = string.Empty;
string pth_output = string.Empty;
for (int i = 1; i < args.Length; i++)
{
//input one
string p_input = args[0];
pth_input = p_input;
pth_input = pth_input.Replace(#"\", #"\\");
//output
string p_output = args[2];
pth_output = p_output;
pth_output = pth_output.Replace(#"\", #"\\");
}
//s = Regex.Replace(s, #"[^\u0000-\u007F]+", string.Empty);
string lx;
using (StreamReader sr = new StreamReader(pth_input))
{
using (StreamWriter x = new StreamWriter(pth_output))
{
while ((lx = sr.ReadLine()) != null)
{
string text = sr.ReadToEnd();
Regex.Replace(text, #"[^\u0000-\u007F]+", "", RegexOptions.Compiled);
x.Write(text);
} sr.Close();
}
}
}
}
}
Thanks in advance guys.

According to documentation, the first string is an input parameter (and not passed by reference, so it could not change anyway). The result of the replacement is in the return value, like so:
var result = Regex.Replace(text, #"[^\u0000-\u007F]+", "", RegexOptions.Compiled);
x.Write(result);
Note that RegexOptions.Compiled might decrease performance here. It makes sense only if you reuse the same regular expression instance on multiple strings. You can still do that, if you create the RegEx instance outside of the loop:
var regex = new Regex(#"[^\u0000-\u007F]+", RegexOptions.Compiled);
using (var sr = new StreamReader(pth_input))
{
using (var x = new StreamWriter(pth_output))
{
while ((lx = sr.ReadLine()) != null)
{
var text = sr.ReadToEnd();
var result = regex.Replace(text, String.Empty);
x.Write(result);
}
}
}

Related

Using C# how can I split a text file into multiple files

How can I split a text file that contains ASCII code SOH and ETX into multiple files?
For exmaple the text file I have named 001234.txt contains the following content:
SOH{ABCDXZY}ETX
SOH{ABCDXZY}ETX
SOH{ABCDXZY}ETX
I would like to split the single text file into multiple text files for each ASCII code that starts with SOH and ends with ETX.
The single text file name should be splitted into 101234.txt , 111234.txt..etc and each contains a single content that starts with SOH and ends with ETX.
I appreciate any help.
using System.IO;
using System.Linq;
namespace ASCII_Split
{
class Program
{
static void Main(string[] args)
{
var txt = "";
const char soh = (char)1;
const char eox = (char)3;
var count = 1;
var pathToFile = #"‪‪C:\Temp\00599060.txt";
using (var sr = new StreamReader(pathToFile))
txt = sr.ReadToEnd();
while (txt.Contains(soh))
{
var outfil = Path.Combine(Path.GetDirectoryName(pathToFile), count.ToString("000"), "_fix.txt");
var eInd = txt.IndexOf(eox);
using (var sw = new StreamWriter(outfil, false))
{
sw.Write(txt.Substring(1, eInd - 1));
}
txt = txt.Substring(eInd + 1);
count++;
}
}
}
}

This should more or less do the trick:
//Read all text from file into a string
var fileContent = File.ReadAllText("001234.txt");
//split text into array according to a Regex pattern
var pattern = #"SOH*ETX";
var splitContent = Regex.Split(fileContent, pattern);
//counter for file names
var counter = 10;
foreach(var content in splitContent)
{
//create file and use stream to write to it
using (var stream = File.Create($"{counter++}1234.txt"))
{
var contentAsBytes = new UTF8Encoding(true).GetBytes(content);
stream.Write(contentAsBytes, 0, contentAsBytes.Length);
}
}

Provided by SOH and ETX you mean the respective control characters, this here should get you on your way:
var txt = "";
const char soh = (char) 1;
const char eox = (char) 3;
var count = 1;
var pathToFile = #"C:\00_Projects_temp\test.txt";
using (var sr = new StreamReader(pathToFile))
txt = sr.ReadToEnd();
while (txt.Contains(soh))
{
var outfil = Path.Combine(Path.GetDirectoryName(pathToFile), count.ToString("000"), "_test.txt");
var eInd = txt.IndexOf(eox);
using (var sw = new StreamWriter(outfil, false))
{
sw.Write(txt.Substring(1, eInd - 1));
}
txt = txt.Substring(eInd + 1);
count++;
}

Thank you LocEngineer the program works, I did little change to concatonate the filename with the counter using "+" instead of ",".
using System.IO;
using System.Linq;
namespace ASCII_Split
{
class Program
{
static void Main(string[] args)
{
var txt = "";
const char soh = (char)1;
const char eox = (char)3;
var count = 1;
var pathToFile = #"C:\Temp\00599060.txt";
using (var sr = new StreamReader (pathToFile))
txt = sr.ReadToEnd();
if (txt.IndexOf(soh) != txt.LastIndexOf(soh))
{
while (txt.Contains(soh))
{
var outfil = Path.Combine(Path.GetDirectoryName(pathToFile), count.ToString("00") + Path.GetFileName(pathToFile));
var eInd = txt.IndexOf(eox);
using (var sw = new StreamWriter(outfil, false))
{
sw.Write(txt.Substring(1, eInd - 1));
}
txt = txt.Substring(eInd + 1);
count++;
}
File.Move((pathToFile), (pathToFile) + ".org");
}
}
}
}

Searching a file, only return strings with duplicates

Currently working on a project for a car park. I have a csv file which contains strings of information regarding car license plates. I have used regex already to return 90% of the license plates, however shorter personalised number plates aren;t returning correctly: ie "AA12" returns as "AA12BC" as it fits another regex.
Each string has two instances of the car license plate, is there a way to only return strings that prove correct for the regex and two instances of the number plate.
Code so far:
//start
using (TextReader reader = File.OpenText(#"C:\Users\user\documents\regdata.csv"))
{
List<string> lines = new List<string>();
string pattern = #"[A-Z]{3}[0-9]{3}";
string line;
while ((line = reader.ReadLine()) != null)
{
lines.Add(line);
}
List<string> regExs = new List<string>();
regExs.Add(#"[A-Z]{3}[0-9]{3}");
regExs.Add(#"[A-Z]{2}[0-9]{2}[A-Z]{3}");
regExs.Add(#"[A-Z]{1}[0-9]{3}[A-Z]{3}");
regExs.Add(#"[A-Z]{1}[0-9]{2}[A-Z]{3}");
regExs.Add(#"[A-Z]{1}[0-9]{1}[A-Z]{3}");
regExs.Add(#"[A-Z]{3}[0-9]{2,3}");
regExs.Add(#"[A-Z]{2}[0-9]{4}");
regExs.Add(#"[A-Z]{3}[0-9]{2}");
regExs.Add(#"[A-Z]{2}[0-9]{2}[A-Z]{3}");
using (StreamWriter writer = new StreamWriter(#"C: \Users\user\Desktop\usersNotes\plates.csv"))
{
foreach (var l in lines.Select(x => x.Split(',')[2]))
{
string result = "";
foreach (var r in regExs)
{
Regex myRegex = new Regex(r);
Match m = myRegex.Match(l);
if (m.Success)
{
result = m.Value;
break;
}
}
writer.WriteLine(l + "," + result);
}
Thanks

This should do it for you. I went ahead and coded (I think) the whole solution for you as I understand it. I made a Regex list rather than a string list; this way you don't have to build and tear down every Regex object with each loop.
Assumptions: (1) Plates never have " or , and (2) Plates don't show up more than twice.
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.IO;
using System.Text;
using System.Windows.Forms;
using System.Text.RegularExpressions;
namespace DupeOnly {
public partial class Form1 : Form{
public Form1(){
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e){
string zRegData = File.ReadAllText(#"C:\Users\user\documents\regdata.csv");
HashSet<string> hsRegData = new HashSet<string>();
bool tfFirst = true;
string[] zAllPlateData = zRegData.Split(','); //License plates don't have comma's
List<Regex> rxList = new List<Regex>();
rxList.Add(new Regex(#"[A-Z]{3}[0-9]{3}"));
rxList.Add(new Regex(#"[A-Z]{2}[0-9]{2}[A-Z]{3}"));
rxList.Add(new Regex(#"[A-Z]{1}[0-9]{3}[A-Z]{3}"));
rxList.Add(new Regex(#"[A-Z]{1}[0-9]{2}[A-Z]{3}"));
rxList.Add(new Regex(#"[A-Z]{1}[0-9]{1}[A-Z]{3}"));
rxList.Add(new Regex(#"[A-Z]{3}[0-9]{2,3}"));
rxList.Add(new Regex(#"[A-Z]{2}[0-9]{4}"));
rxList.Add(new Regex(#"[A-Z]{3}[0-9]{2}"));
rxList.Add(new Regex(#"[A-Z]{2}[0-9]{2}[A-Z]{3}"));
Match m;
using (StreamWriter sw = new StreamWriter(#"C: \Users\user\Desktop\usersNotes\plates.csv")){
for(int Q = 0; Q < zAllPlateData.Length; Q++){
if(hsRegData.Add(zAllPlateData[Q]) == false){
//At this point we know it is a duplicate, must still match a check pattern
foreach(Regex rx in rxList){
m = rx.Match(zAllPlateData[Q]);
if(m.Success){
if(tfFirst){
tfFirst = false;
sw.Write(zAllPlateData[Q]); //First plate doesn't take a comma
}
else{
sw.Write("," + zAllPlateData[Q]); //Comma delimit subsequent plates
}
break;
}
}
}
}
}
}
}
}

Counting number of words in a text file

I'm trying to count the number of words from a text file, namely this, to start.
This is a test of the word count program. This is only a test. If your
program works successfully, you should calculate that there are 30
words in this file.
I am using StreamReader to put everything from the file into a string, and then use the .Split method to get the number of individual words, but I keep getting the wrong value when I compile and run the program.
using System;
using System.IO;
class WordCounter
{
static void Main()
{
string inFileName = null;
Console.WriteLine("Enter the name of the file to process:");
inFileName = Console.ReadLine();
StreamReader sr = new StreamReader(inFileName);
int counter = 0;
string delim = " ,.";
string[] fields = null;
string line = null;
while(!sr.EndOfStream)
{
line = sr.ReadLine();
}
fields = line.Split(delim.ToCharArray());
for(int i = 0; i < fields.Length; i++)
{
counter++;
}
sr.Close();
Console.WriteLine("The word count is {0}", counter);
}
}

Try to use regular expression, e.g.:
int count = Regex.Matches(input, #"\b\w+\b").Count;

this should work for you:
using System;
using System.IO;
class WordCounter
{
static void Main()
{
string inFileName = null;
Console.WriteLine("Enter the name of the file to process:");
inFileName = Console.ReadLine();
StreamReader sr = new StreamReader(inFileName);
int counter = 0;
string delim = " ,."; //maybe some more delimiters like ?! and so on
string[] fields = null;
string line = null;
while(!sr.EndOfStream)
{
line = sr.ReadLine();//each time you read a line you should split it into the words
line.Trim();
fields = line.Split(delim.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
counter+=fields.Length; //and just add how many of them there is
}
sr.Close();
Console.WriteLine("The word count is {0}", counter);
}
}

A couple hints.
What if you just have the sentence "hi" what would be your output?
Your counter calculation is: from 0 through fields.Length, increment counter. How are fields.Length and your counter related?

you're probably getting a one off error, try something like this
counter = 0;
while(!sr.EndOfStream)
{
line = sr.ReadLine();
fields = line.Split(delim.ToCharArray());
counter += field.length();
}
there is no need to iterate over the array to count the elements when you can get the number directly

using System.IO;
using System;
namespace solution
{
class Program
{
static void Main(string[] args)
{
var readFile = File.ReadAllText(#"C:\test\my.txt");
var str = readFile.Split(new char[] { ' ', '\n'}, StringSplitOptions.RemoveEmptyEntries);
System.Console.WriteLine("Number of words: " + str.Length);
}
}
}

//Easy method using Linq to Count number of words in a text file
/// www.techhowdy.com
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace FP_WK13
{
static class Util
{
public static IEnumerable<string> GetLines(string yourtextfile)
{
TextReader reader = new StreamReader(yourtextfile);
string result = string.Empty;
string line;
while ((line = reader.ReadLine()) != null)
{
yield return line;
}
reader.Close();
}
// Word Count
public static int GetWordCount(string str)
{
int words = 0;
string s = string.Empty;
var lines = GetLines(str);
foreach (var item in lines)
{
s = item.ToString();
words = words + s.Split(' ').Length;
}
return words;
}
}
}

Convert StringWriter to string[]

I had a requirement to create a TextWriter to populate some data. I used the StringWriter as a TextWriter and the business logic works fine.
I have a technical requirement that cannot be changed because it may break all the clients. I need to an array of strings string[] from the TextWriter separated by line.
I tried with this but didn't work:
using System;
using System.IO;
using System.Text;
namespace TextWriterToArrayOfStringsDemo
{
internal static class Program
{
private static void Main()
{
var stringBuilder = new StringBuilder();
using (TextWriter writer = new StringWriter(stringBuilder))
{
writer.Write("A");
writer.WriteLine();
writer.Write("B");
}
string fullString = stringBuilder.ToString();
string replaced = fullString.Replace('\n', '\r');
string[] arrayOfString = replaced.Split('\r');
// Returns false but should return true
Console.WriteLine(arrayOfString.Length == 2);
}
}
}
Any idea or suggestion?

Try using Environment.NewLine to split rather than '\n':
var stringBuilder = new StringBuilder();
using (TextWriter writer = new StringWriter(stringBuilder))
{
writer.Write("A");
writer.WriteLine();
writer.Write("B");
}
string fullString = stringBuilder.ToString();
string[] newline = new string[] { Environment.NewLine };
string[] arrayOfString = fullString.Split(newline, StringSplitOptions.RemoveEmptyEntries);
// Returns false but should return true
Console.WriteLine(arrayOfString.Length == 2);

A quick fix would be to use a Regex split:
string[] arrayOfString =
System.Text.RegularExpressions.Regex.Split(stringBuilder.ToString(), "\r?\n")

The old switcheroo (switch position in file)

I would really appreciate if somebody could help me/offer advice on this.
I have a file, probably about 50000 lines long, these files are generated on a weekly basis. each line is identical in terms of type of content.
original file:
address^name^notes
but i need to perform a switch. i need to be able to switch (on each and every line) the address with the name. so after the switch has been done, the names will be first, and then addresses and then notes, like so:
result file:
name^address^notes

50,000 isn't that much these days, so simply reading in the whole file and outputting the wanted format should work fine for you:
string[] lines = File.ReadAllLines(fileName);
string newLine = string.Empty;
foreach (string line in lines)
{
string[] items = line.Split(myItemDelimiter);
newLine = string.Format("{0},{1},{2}", items[1], items[0], items[2]);
// Append to new file here...
}

How about this?
StreamWriter sw = new StreamWriter("c:\\output.txt");
StreamReader sr = new StreamReader("c:\\input.txt");
string inputLine = "";
while ((inputLine = sr.ReadLine()) != null)
{
String[] values = null;
values = inputLine.Split('^');
sw.WriteLine("{0}^{1}^{2}", values[1], values[0], values[2]);
}
sr.Close();
sw.Close();

Go go gadget REGEX!
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication1
{
class Program
{
static string Switcheroo(string input)
{
return System.Text.RegularExpressions.Regex.Replace
(input,
#"^([^^]+)\^([^^]+)\^(.+)$",
"$2^$1^$3",
System.Text.RegularExpressions.RegexOptions.Multiline);
}
static void Main(string[] args)
{
string input = "address 1^name 1^notes1\n" +
"another address^another name^more notes\n" +
"last address^last name^last set of notes";
string output = Switcheroo(input);
Console.WriteLine(output);
Console.ReadKey(true);
}
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Remove Non-ASCII characters from XML file C# - c#

Related

Using C# how can I split a text file into multiple files

Searching a file, only return strings with duplicates

Counting number of words in a text file

Convert StringWriter to string[]

The old switcheroo (switch position in file)

Categories

Resources