C# splitting large excel file to smaller files

C# splitting large excel file to smaller files - c#

I want to split one large Excel file to few smaller and accessible files.
I already tried to use this code but the files are not accessible:
using (System.IO.StreamReader sr = new System.IO.StreamReader("path"))
{
int fileNumber = 0;
while (!sr.EndOfStream)
{
int count = 0;
using (System.IO.StreamWriter sw = new System.IO.StreamWriter("other path" + ++fileNumber + ".xlsx"))
{
sw.AutoFlush = true;
while (!sr.EndOfStream && ++count < 20000)
{
sw.WriteLine(sr.ReadLine());
}
}
}
}
Any ideas?
thanks.

Files, other than text files, don't work this way. You can't simply cut at a certain point and obtain a working copy.
As for Excel files, you may look into the following tutorial, which illustrates how to automate Excel from C#:
https://support.microsoft.com/en-us/help/302084/how-to-automate-microsoft-excel-from-microsoft-visual-c--net
Basically, what you want to do is open your large Excel file, decide where you want to split it (every n rows, every n sheets and so on), read each portion and write into a newly created xlsx.

Related

File wont open after saving

I have some code that is suppose to enter some values into several excel workbooks. Right now the program doesn't even put any values into the workbooks and only saves them. Even like this i get this error when opening the files: Excel cannot open the file **.xlsm because the file format or file extension is not valid. Verify that the file has been corrupted and that the file extension matches the format of the file.
I have writen many programs that work with excel files and never had this problem. In the code you can see that i basically just go through a for loop and save the file.
try
{
fileInfo = new FileInfo(Path.GetDirectoryName(Application.StartupPath) + '\\' + partners[partner].partnerName + #"\PDP_ExSumm_" + partners[partner].partnerName + ".xlsm");
using (ExcelPackage excelPackage = new ExcelPackage(fileInfo))
{
ExcelWorksheet worksheet = excelPackage.Workbook.Worksheets[1];
for (int cell = 0; cell < ExSummCells.Count; cell++)
{
if (ExSummCells[cell] != "")
{
// worksheet.Cells[ExSummCells[cell]].Value = partners[partner].exSummData[partner];
}
excelPackage.Save();
}
}

Replacing Invalid XML characters from an excel file and writing it back to disk causes file is corrupted error on opening in MS Excel

A little background on problem:
We have an ASP.NET MVC5 Application where we use FlexMonster to show the data in grid. The data source is a stored procedure that brings all the data into the UI grid, and once user clicks on export button, it exports the report to Excel. However, in some cases export to excel is failing.
Some of the data has some invalid characters, and it is not possible/feasible to fix the source as suggested here
My approach so far:
EPPlus library fails on initializing the workbook as the input excel file contains some invalid XML characters. I could find that the file is dumped with some invalid character in it. I looked into the possible approaches .
Firstly, I identified the problematic character in the excel file. I first tried to replace the invalid character with blank space manually using Notepad++ and the EPPlus could successfully read the file.
Now using the approaches given in other SO thread here and here, I replaced all possible occurrences of invalid chars. I am using at the moment
XmlConvert.IsXmlChar
method to find out the problematic XML character and replacing with blank space.
I created a sample program where I am trying to work on the problematic excel sheet.
//in main method
String readFile = File.ReadAllText(filePath);
string content = RemoveInvalidXmlChars(readFile);
File.WriteAllText(filePath, content);
//removal of invalid characters
static string RemoveInvalidXmlChars(string inputText)
{
StringBuilder withoutInvalidXmlCharsBuilder = new StringBuilder();
int firstOccurenceOfRealData = inputText.IndexOf("<t>");
int lastOccurenceOfRealData = inputText.LastIndexOf("</t>");
if (firstOccurenceOfRealData < 0 ||
lastOccurenceOfRealData < 0 ||
firstOccurenceOfRealData > lastOccurenceOfRealData)
return inputText;
withoutInvalidXmlCharsBuilder.Append(inputText.Substring(0, firstOccurenceOfRealData));
int remaining = lastOccurenceOfRealData - firstOccurenceOfRealData;
string textToCheckFor = inputText.Substring(firstOccurenceOfRealData, remaining);
foreach (char c in textToCheckFor)
{
withoutInvalidXmlCharsBuilder.Append((XmlConvert.IsXmlChar(c)) ? c : ' ');
}
withoutInvalidXmlCharsBuilder.Append(inputText.Substring(lastOccurenceOfRealData));
return withoutInvalidXmlCharsBuilder.ToString();
}
If I replaces the problematic character manually using notepad++, then the file opens fine in MSExcel. The above mentioned code successfully replaces the same invalid character and writes the content back to the file. However, when I try to open the excel file using MS Excel, it throws an error saying that file may have been corrupted and no content is displayed (snapshots below). Moreover, Following code
var excelPackage = new ExcelPackage(new FileInfo(filePath));
on the file that I updated via Notepad++, throws following exception
"CRC error: the file being extracted appears to be corrupted. Expected 0x7478AABE, Actual 0xE9191E00"}
My Questions:
Is my approach to modify content this way correct?
If yes, How can I write updated string to an Excel file?
If my approach is wrong then, How can I proceed to get rid of invalid XML chars?
Errors shown on opening file (without invalid XML char):
First Pop up
When I click on yes
Thanks in advance !

It does sounds like a binary (presumable XLSX) file based on your last comment. To confirm, open the file created by the FlexMonster with 7zip. If it opens properly and you see a bunch of XML files in folders, its a XLSX.
In that case, a search/replace on a binary file sounds like a very bad idea. It might work on the XML parts but might also replace legit chars in other parts. I think the better approach would be to do as #PanagiotisKanavos suggests and use ZipArchive. But you have to do rebuild it in the right order otherwise Excel complains. Similar to how it was done here https://stackoverflow.com/a/33312038/1324284, you could do something like this:
public static void ReplaceXmlString(this ZipArchive xlsxZip, FileInfo outFile, string oldString, string newstring)
{
using (var outStream = outFile.Open(FileMode.Create, FileAccess.ReadWrite))
using (var copiedzip = new ZipArchive(outStream, ZipArchiveMode.Update))
{
//Go though each file in the zip one by one and copy over to the new file - entries need to be in order
foreach (var entry in xlsxZip.Entries)
{
var newentry = copiedzip.CreateEntry(entry.FullName);
var newstream = newentry.Open();
var orgstream = entry.Open();
//Copy non-xml files over
if (!entry.Name.EndsWith(".xml"))
{
orgstream.CopyTo(newstream);
}
else
{
//Load the xml document to manipulate
var xdoc = new XmlDocument();
xdoc.Load(orgstream);
var xml = xdoc.OuterXml.Replace(oldString, newstring);
xdoc = new XmlDocument();
xdoc.LoadXml(xml);
xdoc.Save(newstream);
}
orgstream.Close();
newstream.Flush();
newstream.Close();
}
}
}
When it is used like this:
[TestMethod]
public void ReplaceXmlTest()
{
var datatable = new DataTable("tblData");
datatable.Columns.AddRange(new[]
{
new DataColumn("Col1", typeof (int)),
new DataColumn("Col2", typeof (int)),
new DataColumn("Col3", typeof (string))
});
for (var i = 0; i < 10; i++)
{
var row = datatable.NewRow();
row[0] = i;
row[1] = i * 10;
row[2] = i % 2 == 0 ? "ABCD" : "AXCD";
datatable.Rows.Add(row);
}
using (var pck = new ExcelPackage())
{
var workbook = pck.Workbook;
var worksheet = workbook.Worksheets.Add("source");
worksheet.Cells.LoadFromDataTable(datatable, true);
worksheet.Tables.Add(worksheet.Cells["A1:C11"], "Table1");
//Now similulate the copy/open of the excel file into a zip archive
using (var orginalzip = new ZipArchive(new MemoryStream(pck.GetAsByteArray()), ZipArchiveMode.Read))
{
var fi = new FileInfo(#"c:\temp\ReplaceXmlTest.xlsx");
if (fi.Exists)
fi.Delete();
orginalzip.ReplaceXmlString(fi, "AXCD", "REPLACED!!");
}
}
}
Gives this:
Just keep in mind that this is completely brute force. Anything you can do to make the file filter smarter rather then simply doing ALL xml files would be a very good thing. Maybe limit it to the SharedString.xml file if that is where the problem lies or in the xml files in the worksheet folders. Hard to say without knowing more about the data.

Record count for Excel file with extension CSV

I am trying to count the record of excel file with extension csv with the following code but I am not getting exact number of rows.
int lineCount = 0;
using (var reader = File.OpenText(#fileFullPath))
{
while (reader.ReadLine()!= null)
{
lineCount++;
}
}
Can anyone please advise me on this.

You can do:
var lineCount = File.ReadAllLines(#fileFullPath).Length

First try to open the mentioned .csv file using notepad. If the file opens and readable, It means that your file is human readable and it can be read by File reader line by line without further processing.
If it is so you can use var lineCount = File.ReadAllLines(#fileFullPath).Length. Otherwise you need to have library for reading Excel file

Create ZIP File Then Send to Client

I have a button in my web page that will export CSV files. There are 5 files in total. When the client clicks the button, the server will create the files, compress them into one ZIP file, then send the ZIP file to the client for download.
I have heard around the forums about SharpZipLab and DotNetZip, but I haven't explored any yet. I have also heard using System.IO.Compression. Which of these methods would you recommend?
I have this code to create the 5 CSV files:
StringBuilder sb = new StringBuilder();
DataTable[] dtCSV =
{
file1BLO.SelectFile1ForCSV(),
file2BLO.SelectFile2ForCSV(),
file3BLO.SelectFile3ForCSV(),
file4BLO.SelectFile4ForCSV(),
file5BLO.SelectFile5ForCSV()
};
for (int i = 0; i <= 4; i++)
{
DataTable dt = dtCSV[i];
foreach (DataRow dr in dt.Rows)
{
string[] fields = dr.ItemArray.Select(field => field.ToString()).ToArray();
sb.AppendLine(string.Join("|", fields));
}
Response.ContentType = "application/text";
Response.AddHeader("content-disposition", "attachment;filename=CAPRES-FILE" +
(i + 1) + "-" + DateTime.Now.ToString("yyyyMMdd-HHmmss") + ".txt");
Response.Output.Write(sb);
Response.Flush();
sb.Clear();
}
Response.End();
EDIT I'm using ASP.NET v4.0.
EDIT 2 Apparently I have System.IO.Compression, which is weird because I though it is only supported in v4.5. Coincidentally, I don't have System.IO.Packaging.

With the help of Sachu, we were able to accomplish this requirement. We used DotNetZip over SharpZipLib due to its licensing issues.
In facilitate our development of this functionality, I ought to create a program flow based on my requirements:
Create text files
Add the text files to a folder
Compress this folder in Zip format
Send to client using Response
Delete files
Step 0 - Setup Project
Before we start the process, we must prepare the project. This include adding necessary folders and instantiate variables.
First we add a folder to which we will 'temporarily' add the text files. This folder will also be the one that will get compressed. I decided to create the folder in the root directory of the project with the name CSV.
Now we'll be using the DotNetZip library. You can download it here. Add the library to your project references. Then add the using, which is using Ionic.Zip;.
Then we instantiate the variables such as the zipFileName, textFileName, etc. The names speak for themselves.
The data that I'll be using for the text files will be from the DataTable[] array, which each DataTable corresponding to a specific SQL query.
DataTable[] dtCSV =
{
file1BLO.SelectFile1ForCSV(),
file2BLO.SelectFile2ForCSV(),
file3BLO.SelectFile3ForCSV(),
file4BLO.SelectFile4ForCSV(),
file5BLO.SelectFile5ForCSV()
};
StringBuilder sb = new StringBuilder();
string textFileNameTemplate = Server.MapPath(#"~\CSV") + #"\file";
Response.Clear();
Response.BufferOutput = false;
Response.ContentType = "application/zip";
Response.AppendHeader("content-disposition", "attachment;filename=CAPRES-" +
DateTime.Now.ToString("yyyyMMdd-HHmmss") + ".zip");
Step 1 - Create Text Files
This is fairly easy. I used a StringBuilder to convert the results from the DataTables. Using this, I then used a StreamWriter to build the text files themselves.
for (int i = 0; i <= 4; i++)
{
DataTable dt = dtCSV[i];
foreach (DataRow dr in dt.Rows)
{
string[] fields = dr.ItemArray.Select(field => field.ToString()).ToArray();
sb.AppendLine(string.Join("|", fields));
}
string textFileName = textFileNameTemplate + (i + 1) + ".txt";
var textFile = new StreamWriter(textFileName);
textFile.WriteLine(sb.ToString());
textFile.Flush();
textFile.Close();
}
Notice how I used the textFileNameTemplate variable. I append the iterator and a .txt file extension. Therefore, we will have files named file1.txt, file2.txt, file3.txt, etc.
Step 3 & 4 - Compress The Folder & Send To Client
Now we can proceed with the zipping. We modified the code in Step 2 to accommodate the library.
using (ZipFile zip = new ZipFile()) //encapsulate Step 2 code in this code block
{
for (int i = 0; i <= 4; i++)
{
DataTable dt = dtCSV[i];
foreach (DataRow dr in dt.Rows)
{
string[] fields = dr.ItemArray.Select(field => field.ToString()).ToArray();
sb.AppendLine(string.Join("|", fields));
}
string textFileName = textFileNameTemplate + (i + 1) + ".txt";
var textFile = new StreamWriter(textFileName);
textFile.WriteLine(sb.ToString());
textFile.Flush();
textFile.Close();
sb.Clear();
zip.AddFile(textFileName, #"\"); //this is new
}
zip.Save(Response.OutputStream); //this is also new
}
Response.Flush();
Response.End();
zip.AddFile(textFileName, #"\"); adds the text file to an archive. The #"\" means that DotNetZip will not create subfolders that lead to the file, e.g. if my file is in this path: C:\User\Documents\...\file1.txt, the archive would have a similar structure of folders. With #"\", the archive will only contain the text file.
Also take note of sb.Clear(); and its position in the code. It's important that it is inside the for loop but after the textFile.WriteLine(sb.ToString()); line. This makes sure that strings written before are cleared before looping again. This avoid carrying over strings from File1 to File2, and File2 to File3, and so on.
zip.Save(Response.OutputStream); will directly output the Zip file to the Response and does not save the file in the server.
Step 5 - Delete Files
This step depends on your requirements. For me, we will delete the generated files. Using System.IO.File, we will delete the text files. After the using ZipFile zip = new ZipFile()) code block, we'll add the following lines:
for (int i = 1; i <= 5; i++)
{
File.Delete(textFileNameTemplate + i + ".txt");
}
My code probably isn't the most optimized code. But it works. If anyone can suggest a better code that would be great. But for now, I'll be using this code. Many thanks! Especially to Sachu, a really helpful person.

How can I split a PDF file by file size using C#?

I found a How to break a PDF into parts tutorial that demonstrates how to split a PDF file into separate PDF files either by pages or by maximum file size using Adobe Acrobat:
I have found many examples on StackOverflow on how to split a PDF by page with C#. But how can I do the latter? How can I split a PDF file into multiple PDF files by a maximum file size using C#?
For example, say I have a PDF file that is 70 pages and 40 MB. Instead of splitting into 7 PDF files of 10 pages each, how can I split the file into around 5 PDF files that are no greater than 10 MB each using C#?
So far, the best method I have seen was in Using itextsharp to split a pdf into smaller pdf's based on size where Cyfer13 used iTextSharp to split the file by page and then group those page files by size. But is a more direct way to accomplish this without having to first split by page?

Starting from PDFsharp Sample: Split Document, I wrote the following SplitBySize method:
public static void SplitBySize(string filename, long limit)
{
PdfDocument input = PdfReader.Open(filename, PdfDocumentOpenMode.Import);
PdfDocument output = CreateDocument(input);
string name = Path.GetFileNameWithoutExtension(filename);
string temp = string.Format("{0} - {1}.pdf", name, 0);
int j = 1;
for (int i = 0; i < input.PageCount; i++)
{
PdfPage page = input.Pages[i];
output.AddPage(page);
output.Save(temp);
FileInfo info = new FileInfo(temp);
if (info.Length <= limit)
{
string path = string.Format("{0} - {1}.pdf", name, j);
if (File.Exists(path))
{
File.Delete(path);
}
File.Move(temp, path);
}
else
{
if (output.PageCount > 1)
{
output = CreateDocument(input);
++j;
--i;
}
else
{
throw new Exception(
string.Format("Page #{0} is greater than the document size limit of {1} MB (size = {2})",
i + 1,
limit / 1E6,
info.Length));
}
}
}
}
I will continue to test, but it is working so far.

This is an untested sample code, assuming you are prepared to split at the purely binary level, i.e. the parts won't be read by PDF Reader, and you will have to rejoin the parts to make it readable:
The below code first gets the pdf file in a byte[] array. Then based on an arbitary partition size (5 in this example), gets the file size of each part-binary file. Then, it will create a temporary memory-stream and loop thru to create each partition and writes to a new .part file.
(You might need to make some changes to make this workable).
byte[] pdfBytes = File.ReadAllBytes("c:\foo.pdf");
int fileSize = pdfBytes.Length / 5; //assuming foo is 40MB filesize will be abt 8MB
MemoryStream m = new MemoryStream(pdfBytes);
for (int i = 0; i < 4; i++)
{
byte[] tbytes = new byte[fileSize];
m.Read(tbytes,i*fileSize,fileSize);
File.WriteAllBytes("C:\foo" + i + ".part",tbytes);
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# splitting large excel file to smaller files - c#

Related

File wont open after saving

Replacing Invalid XML characters from an excel file and writing it back to disk causes file is corrupted error on opening in MS Excel

Record count for Excel file with extension CSV

Create ZIP File Then Send to Client

How can I split a PDF file by file size using C#?

Categories

Resources