Quicker way of cleaning XML files from invalid characters

Quicker way of cleaning XML files from invalid characters - c#

I found a way to clean an XML file of invalid characters, which works fine, but it is a bit slow. The cleaning takes ~10-20s which is not appreciated by users.
it seems like a huge waste of time to use streamread/write to create a clean file and then use xmlreader, is it possible to clean the line during XMLread or atleast use streamReader as an input to XMLreader to save the time saving the file?
I'm trying to get the team who creates the databases to create clean files before uploading them, but it is a slow process...
XmlReaderSettings settings = new XmlReaderSettings { CheckCharacters = false};
cleanDatabase = createCleanSDDB(database);
using (XmlReader sddbReader = XmlReader.Create(cleanDatabase, settings))
{ //Parse XML... }
private string createCleanSDDB(String sddbPath)
{
string fileName = getTmpFileName(); // get a temporary file name from the OS
string line;
string cleanLine;
using (StreamReader streamReader = new StreamReader(sddbPath, Encoding.UTF8))
using (StreamWriter streamWriter = new StreamWriter(fileName))
{
while ((line = streamReader.ReadLine()) != null)
{
cleanLine = getCleanLine(line);
streamWriter.WriteLine(cleanLine);
}
}
return fileName;
}
private string getCleanLine(string dirtyLine)
{
const string regexPattern = #"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";
string cleanLine = Regex.Replace(dirtyLine, regexPattern, "");
return cleanLine;
}

Related

Saving/Loading Json string to/from .dat

I am trying to learn how to save a json string in a .dat file, but I have trouble converting it back to the correct json string. My new string at the end starts with 2 special characters (rest of it is correct) and I am not sure why.
//Saving
string save = "a json string";
string path = #"E:\tempTest\MyTest.dat";
if (!File.Exists(path))
{
FileStream myFile = File.Create(path);
BinaryWriter binaryfile = new BinaryWriter(myFile);
binaryfile.Write(save);
binaryfile.Close();
myFile.Close();
}
//Loading
string path = #"E:\tempTest\MyTest.dat";
StreamReader objInput = new StreamReader(path, System.Text.Encoding.Default);
string contents = objInput.ReadToEnd().Trim();
string [] split = System.Text.RegularExpressions.Regex.Split(contents, "\\s+", RegexOptions.None);
StringBuilder sb = new StringBuilder();
foreach (string s in split)
{
sb.AppendLine(s);
}
string save = sb.ToString(); //string starts with 2 wrong special characters
I can obviously fix it with a simple save = save.Substring(2), but I would like to understand what the error was in my code (I guess the "\\s+" part of Regex is wrong).
Also, I am not exactly sure if this is still a good way of converting json to a data file and back. This example of how to do it, is from a 10 year old post I found online.

As posted in the comments, I should have used BinaryReader to read the file. This solved the problem.
//Loading
string path = #"E:\tempTest\MyTest.dat";
var stream = File.Open(path, FileMode.Open);
var reader = new BinaryReader(stream, Encoding.UTF8, false);
string save = reader.ReadString();
stream.Close();
reader.Close();

Why does FileStream sometimes ignore invisible characters?

I have two blocks of code that I've tried using for reading data out of a file-stream in C#. My overall goal here is to try and read each line of text into a list of strings, but they are all being read into a single string (when opened with read+write access together)...
I am noticing that the first block of code correctly reads in all of my carriage returns and line-feeds, and the other ignores them. I am not sure what is really going on here. I open up the streams in two different ways, but that shouldn't really matter right? Well, in any case here is the first block of code (that correctly reads-in my white-space characters):
StreamReader sr = null;
StreamWriter sw = null;
FileStream fs = null;
List<string> content = new List<string>();
List<string> actual = new List<string>();
string line = string.Empty;
// first, open up the file for reading
fs = File.OpenRead(path);
sr = new StreamReader(fs);
// read-in the entire file line-by-line
while(!string.IsNullOrEmpty((line = sr.ReadLine())))
{
content.Add(line);
}
sr.Close();
Now, here is the block of code that ignores all of the white-space characters (i.e. line-feed, carriage-return) and reads my entire file in one line.
StreamReader sr = null;
StreamWriter sw = null;
FileStream fs = null;
List<string> content = new List<string>();
List<string> actual = new List<string>();
string line = string.Empty;
// first, open up the file for reading/writing
fs = File.Open(path, FileMode.Open);
sr = new StreamReader(fs);
// read-in the entire file line-by-line
while(!string.IsNullOrEmpty((line = sr.ReadLine())))
{
content.Add(line);
}
sr.Close();
Why does Open cause all data to be read as a single line, and OpenRead works correctly (reads data as multiple lines)?
UPDATE 1
I have been asked to provide the text of the file that reproduces the problem. So here it is below (make sure that CR+LF is at the end of each line!! I am not sure if that will get pasted here!)
;$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
;$$$$$$$$$ $$$$$$$
;$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
;
;
;
UPDATE 2
An exact block of code that reproduces the problem (using the text above for the file). In this case I am actually seeing the problem WITHOUT trying Open and only using OpenRead.
StreamReader sr = null;
StreamWriter sw = null;
FileStream fs = null;
List<string> content = new List<string>();
List<string> actual = new List<string>();
string line = string.Empty;
try
{
// first, open up the file for reading/writing
fs = File.OpenRead(path);
sr = new StreamReader(fs);
// read-in the entire file line-by-line
while(!string.IsNullOrEmpty((line = sr.ReadLine())))
{
content.Add(line);
}
sr.Close();
// now, erase the contents of the file
File.WriteAllText(path, string.Empty);
// make sure that the contents of the file have been erased
fs = File.OpenRead(path);
sr = new StreamReader(fs);
if (!string.IsNullOrEmpty(line = sr.ReadLine()))
{
Trace.WriteLine("Failed: Could not erase the contents of the file.");
Assert.Fail();
}
else
{
Trace.WriteLine("Passed: Successfully erased the contents of the file.");
}
// now, attempt to over-write the contents of the file
fs.Close();
fs = File.OpenWrite(path);
sw = new StreamWriter(fs);
foreach(var l in content)
{
sw.Write(l);
}
// read back the over-written contents of the file
fs.Close();
fs = File.OpenRead(path);
sr = new StreamReader(fs);
while (!string.IsNullOrEmpty((line = sr.ReadLine())))
{
actual.Add(line);
}
// make sure the contents of the file are correct
if(content.SequenceEqual(actual))
{
Trace.WriteLine("Passed: The contents that were over-written are correct!");
}
else
{
Trace.WriteLine("Failed: The contents that were over-written are not correct!");
}
}
finally
{
// close out all the streams
fs.Close();
// finish-up with a message
Trace.WriteLine("Finished running the overwrite-file test.");
}

Your new file generated by
foreach(var l in content)
{
sw.Write(l);
}
does not contain end-of-line characters because end-of-line characters are not included in content.
As #DaveKidder points out in this thread over here, the spec for StreamReader.ReadLine specifically says that the resulting line does not include end of line.
When you do
while(!string.IsNullOrEmpty((line = sr.ReadLine())))
{
content.Add(line);
}
sr.Close();
You are losing end of line characters.

How do I consume a file resource as a string, in Visual Studio?

When I add a txt file as a resource to a project, how can I then consume the contents of that resource as a string?
The closest I've been able to get is by using the Resource Manager, to pull an unmanaged stream. However, this throws a null error:
using (StreamReader sr = new StreamReader(
Properties.Resources.ResourceManager.GetStream(
"TestFile.txt", CultureInfo.CurrentCulture)))
{
Console.WriteLine(sr.ReadToEnd());
}

You could do this too:
var myAss = Assembly.GetExecutingAssembly();
var mytxtFileResource = "Namespace.Project.MyTxtFile.txt";
using (Stream stream = assembly.GetManifestResourceStream(mytxtFileResource))
using (StreamReader reader = new StreamReader(stream))
{
string result = reader.ReadToEnd();
}

You dont need to do it like that for text files
Just write it like this
https://msdn.microsoft.com/en-us/library/aa287548(v=vs.71).aspx
System.IO.StreamWriter file = new System.IO.StreamWriter("c:\\test.txt");
file.WriteLine(lines);
file.Close();
and read it like this:
https://msdn.microsoft.com/en-us/library/aa287535(v=vs.71).aspx
int counter = 0;
string line;
// Read the file and display it line by line.
System.IO.StreamReader file =
new System.IO.StreamReader("c:\\test.txt");
while((line = file.ReadLine()) != null)
{
Console.WriteLine (line);
counter++;
}
file.Close();

How to access a text file in c# that is being used by another process

I have text file which is being been used by modscan to write data into the file. At a particular time I have to read the data and save in database. In offline mode ie; without modscan using it I can read the data and very well save in database. however as it online with modscan it gives exception
Cannot access file as it been used by other process.
My code:
using System.IO;
string path = dt.Rows[i][11].ToString();
string[] lines = System.IO.File.ReadAllLines(#path);
path has "E:\Metertxt\02.txt"
So what changes I need to make in order to read it without interfering with modscan.
I googled and I found this which might work, however I am not sure how to use it
FileShare.ReadWrite

You can use a FileStream to open a file that is already open in another application. Then you'll need a StreamReader if you want to read it line by line. This works, assuming a file encoding of UTF8:
using (var stream = new FileStream(#"c:\tmp\locked.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using (var reader = new StreamReader(stream, Encoding.UTF8))
{
string line;
while ((line = reader.ReadLine()) != null)
{
// Do something with line, e.g. add to a list or whatever.
Console.WriteLine(line);
}
}
}
Alternative in case you really need a string[]:
var lines = new List<string>();
using (var stream = new FileStream(#"c:\tmp\locked.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using (var reader = new StreamReader(stream, Encoding.UTF8))
{
string line;
while ((line = reader.ReadLine()) != null)
{
lines.Add(line);
}
}
}
// Now you have a List<string>, which can be converted to a string[] if you really need one.
var stringArray = lines.ToArray();

FileStream fstream = new FileStream("#path", FileMode.Open,FileAccess.Read, FileShare.ReadWrite);
StreamReader sreader = new StreamReader(fstream);
List<string> lines = new List<string>();
string line;
while((line = sreader.ReadeLine()) != null)
lines.Add(line);
//do something with the lines
//if you need all lines at once,
string allLines = sreader.ReadToEnd();

Split large XML file after string found

What I have:
A large XML file # nearly 1 million lines worth of content. Example of content:
<etc35yh3 etc="numbers" etc234="a" etc345="date"><something><some more something></some more something></something></etc123>
<etc123 etc="numbers" etc234="a" etc345="date"><something><some more something></some more something></something></etc123>
<etc15y etc="numbers" etc234="a" etc345="date"><something><some more something></some more something></something></etc123>
^ repeat that by 900k or so lines (content changing of course)
What I need:
Search the XML file for "<etc123". Once found move (write) that line along with all lines below it to a separate XML file.
Would it be advisable to use a method such as File.ReadAllLines for the search portion? What would you all recommend for the writing portion. Line by line is not an option as far as I can tell as it would take much too long.

To quite literaly discard the content above your search string, I would not use File.ReadAllLines, as it would load the entire file into memory. Try File.Open and wrap it in a StreamReader. Loop on StreamReader.ReadLine, then start writing to a new StreamWriter, or do a byte copy on the underlying filestream.
An example of how to do so with StreamWriter/StreamReader alone is listed below.
//load the input file
//open with read and sharing
using (FileStream fsInput = new FileStream("input.txt",
FileMode.Open, FileAccess.Read, FileShare.Read))
{
//use streamreader to search for start
var srInput = new StreamReader(fsInput);
string searchString = "two";
string cSearch = null;
bool found = false;
while ((cSearch = srInput.ReadLine()) != null)
{
if (cSearch.StartsWith(searchString, StringComparison.CurrentCultureIgnoreCase)
{
found = true;
break;
}
}
if (!found)
throw new Exception("Searched string not found.");
//we have the data, write to a new file
using (StreamWriter sw = new StreamWriter(
new FileStream("out.txt", FileMode.OpenOrCreate, //create or overwrite
FileAccess.Write, FileShare.None))) // write only, no sharing
{
//write the line that we found in the search
sw.WriteLine(cSearch);
string cline = null;
while ((cline = srInput.ReadLine()) != null)
sw.WriteLine(cline);
}
}
//both files are closed and complete

You can copy with LINQ2XML
XElement doc=XElement.Load("yourXML.xml");
XDocument newDoc=new XDocument();
foreach(XElement elm in doc.DescendantsAndSelf("etc123"))
{
newDoc.Add(elm);
}
newDoc.Save("yourOutputXML.xml");

You could do one line at a time... Would not use read to end if checking contents of each line.
FileInfo file = new FileInfo("MyHugeXML.xml");
FileInfo outFile = new FileInfo("ResultFile.xml");
using(FileStream write = outFile.Create())
using(StreamReader sr = file.OpenRead())
{
bool foundit = false;
string line;
while((line = sr.ReadLine()) != null)
{
if(foundit)
{
write.WriteLine(line);
}
else if (line.Contains("<etc123"))
{
foundit = true;
}
}
}
Please note, this method may not produce valid XML, given your requirements.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Quicker way of cleaning XML files from invalid characters - c#

Related

Saving/Loading Json string to/from .dat

Why does FileStream sometimes ignore invisible characters?

How do I consume a file resource as a string, in Visual Studio?

How to access a text file in c# that is being used by another process

Split large XML file after string found

Categories

Resources