Handling large Excel files with shared strings

Handling large Excel files with shared strings - c#

Using OpenXML, Microsoft recommends using the SAX approach:
https://msdn.microsoft.com/en-us/library/office/gg575571.aspx
So rather than loading the whole document DOM in memory, you can read the file serially with OpenXmlReader. For example:
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
string text;
while (reader.Read())
{
if (reader.ElementType == typeof(CellValue))
{
text = reader.GetText();
Console.Write(text + " ");
}
}
But this kinda falls down when you have cells with the SharedString data type. Those are stored separate from the sheet data in the shared string table and, as far as I can see, there's no real way to avoid having to load the entire shared string table. For example, I can do this:
var sharedStrings = wbPart.SharedStringTablePart.SharedStringTable.Cast<SharedStringItem>()
.Select(i => i.Text.Text).ToArray();
And then I can do something like:
var row = reader.LoadCurrentElement() as Row;
var cells = row.Descendants<Cell>();
var cellValues = cells.Select(c => c.DataType != null
&& c.DataType == CellValues.SharedString ?
sharedStrings[int.Parse(c.CellValue.Text)] : c.CellValue.Text).ToArray();
Which works, but I had to load the entire shared string table, which could be very large if the file has a lot of unique strings. Is there a more efficient way handle looking up the shared strings as your process each row of the file?

Related

Parsing cell value of Excel spreadsheet

I parse cell located in A2 address. This returns 3 value instead of the expected Category 1.
test.xlsx
using System;
using System.Linq;
using DocumentFormat.OpenXml.Packaging;
using X = DocumentFormat.OpenXml.Spreadsheet;
namespace DotNetSandbox.SO
{
public class IncorrectCellValue
{
public static void ParseCellValue()
{
using SpreadsheetDocument doc = SpreadsheetDocument.Open(#"c:\temp\test.xlsx", false);
X.Sheet sheet = doc.WorkbookPart.Workbook.Descendants<X.Sheet>().First();
WorksheetPart wsPart = (WorksheetPart)doc.WorkbookPart.GetPartById(sheet.Id);
X.Cell cell = wsPart.Worksheet.Descendants<X.Cell>().First(c => c.CellReference == "A2");
string cellValue = cell.CellValue.Text;
Console.WriteLine(cellValue);
Console.ReadKey();
}
}
}
OUTPUT:
3
Target: .NET 5
DocumentFormat.OpenXml version: 2.13.0
I do something wrong or maybe is it a library bug?

Use this method
public static string GetCellValue(string fileName,
string addressName, string sheetName = "")
{
string value = null;
// Open the spreadsheet document for read-only access.
using (SpreadsheetDocument document =
SpreadsheetDocument.Open(fileName, false))
{
// Retrieve a reference to the workbook part.
WorkbookPart wbPart = document.WorkbookPart;
// Find the sheet with the supplied name, and then use that
// Sheet object to retrieve a reference to the first worksheet.
var theSheets = wbPart.Workbook.Descendants<Sheet>();
Sheet theSheet = string.IsNullOrEmpty(sheetName) ? theSheets.FirstOrDefault() : theSheets.FirstOrDefault(x => x.Name == sheetName);
// Throw an exception if there is no sheet.
if (theSheet == null)
{
throw new ArgumentException("sheetName");
}
// Retrieve a reference to the worksheet part.
WorksheetPart wsPart =
(WorksheetPart)(wbPart.GetPartById(theSheet.Id));
// Use its Worksheet property to get a reference to the cell
// whose address matches the address you supplied.
Cell theCell = wsPart.Worksheet.Descendants<Cell>().
Where(c => c.CellReference == addressName).FirstOrDefault();
// If the cell does not exist, return an empty string.
if (theCell.InnerText.Length > 0)
{
value = theCell.InnerText;
// If the cell represents an integer number, you are done.
// For dates, this code returns the serialized value that
// represents the date. The code handles strings and
// Booleans individually. For shared strings, the code
// looks up the corresponding value in the shared string
// table. For Booleans, the code converts the value into
// the words TRUE or FALSE.
if (theCell.DataType != null)
{
switch (theCell.DataType.Value)
{
case CellValues.SharedString:
// For shared strings, look up the value in the
// shared strings table.
var stringTable =
wbPart.GetPartsOfType<SharedStringTablePart>()
.FirstOrDefault();
// If the shared string table is missing, something
// is wrong. Return the index that is in
// the cell. Otherwise, look up the correct text in
// the table.
if (stringTable != null)
{
value =
stringTable.SharedStringTable
.ElementAt(int.Parse(value)).InnerText;
}
break;
case CellValues.Boolean:
switch (value)
{
case "0":
value = "FALSE";
break;
default:
value = "TRUE";
break;
}
break;
}
}
}
}
return value;
}
You stuck here:
If the cell represents an integer number, you are done.
For dates, this code returns the serialized value that
represents the date. The code handles strings and
Booleans individually. For shared strings, the code
looks up the corresponding value in the shared string
table. For Booleans, the code converts the value into
the words TRUE or FALSE.
I was able to get Category 1 by running this code:
var cellValue = GetCellValue(#"c:\test.xlsx", "A2");
Microsoft Doc
Notice that I changed the original method to get the first sheet if you do not pass the sheet name to the method.
What is Shared String:
To optimize the use of strings in a spreadsheet, SpreadsheetML stores a single instance of the string in a table called the shared string table. The cells then reference the string by index instead of storing the value inline in the cell value. Excel always creates a shared string table when it saves a file.

How to uppercase all the words in excel spreadsheet using NPOI C#

I need to convert all the words conataing in a sheet to uppercase using NPOI in C#; I can't find method to do this.
Before applying uppercase : Cell[1;1]=stackoverflow
After applying uppercase : Cell[1;1]=STACKOVERFLOW

I don't think it is possible without looping through cells using NPOI.Probably it can be done using Interop to Excel since it is possible to select range in file and perform some actions on it (like in Excel), but NPOI doesn't offers such ability.
Howewer, you don't need to loop through all cells in sheet since there exists properties FirstRowNum and LastRowNum and they gives you range of rows actually containing data.
So your loop could look like (converting to uppercase all strings from the first worksheet of file):
var hssfwb;
using (var file = new FileStream(#"your_file.xls", FileMode.Open, FileAccess.Read))
{
hssfwb = new HSSFWorkbook(file);
file.Close();
}
var sheet = hssfwb.GetSheetAt(0);
for (int i = sheet.FirstRowNum; i <= sheet.LastRowNum; i++)
{
var row = sheet.GetRow(i);
if (row != null)
{
foreach (ICell cell in row.Cells.Where(c => c.CellType == CellType.String))
cell.SetCellValue(cell.StringCellValue.ToUpper());
}
}

Reading an uploaded Excel file

I am building a quick proof of concept project to show the ability to parse an excel file. Based on Microsoft documentation (How to parse and read a large spreadsheet document), it seems the sdk of choice is Open Xml.
The proof of concept gives a basic form for a user to upload a file (.xlsx). The controller reads the file and spits back the content. I am struggling to grab the value of the cell, instead, it seems like I am only able to get some sort of identifier or reference to the text. Here is my code with some examples
View
#using(Html.BeginForm("Index", "Home", FormMethod.Post, new{ enctype="multipart/form-data" } ))
{
<input type="file" name="file"/>
<input type="submit"/>
}
<br/>
#Html.Raw(ViewBag.Text)
Action
[HttpPost]
public ActionResult Index(HttpPostedFileBase file)
{
ViewBag.Text = "";
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(file.InputStream, false))
{
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();
string text;
foreach (Row r in sheetData.Elements<Row>())
{
foreach (Cell c in r.Elements<Cell>())
{
text = c.CellValue.Text;
ViewBag.Text += text + ", ";
}
ViewBag.Text += "<br />";
}
}
return this.View();
}
Excel File
| Hello | World |
| World | Hello |
Output
0, 1,
1, 0,
As you can see, the 0 represents "Hello" and the 1 represents "World". I've tested this with a larger data set and have confirmed that identical words have the same value when printed to the screen.
The example is pretty much copy/pasted from the MS website. I've tried accessing other c.CellValue properties, such as InnerText and InnerXml only to get the same results. What am I doing wrong? Is Open XML is good SDK to use for this purpose?

Use this method to get the exact value of the cell instead of numbers:
private string ReadExcelCell(Cell cell, WorkbookPart workbookPart)
{
var cellValue = cell.CellValue;
var text = (cellValue == null) ? cell.InnerText : cellValue.Text;
if ((cell.DataType != null) && (cell.DataType == CellValues.SharedString))
{
text = workbookPart.SharedStringTablePart.SharedStringTable
.Elements<SharedStringItem>().ElementAt(
Convert.ToInt32(cell.CellValue.Text)).InnerText;
}
return (text ?? string.Empty).Trim();
}

Get Tables (workparts) of a sheet of excel by OpenXML SDK

I have 3 tables in a sheet of excel file,
and I use OpenXML SDK to read the Excel file, like this:
SpreadSheetDocument document = SpreadSheetDDocument.open(/*read it*/);
foreach(Sheet sheet in document.WorkbookPart.Workbook.Sheets)
{
//I need each table or work part of sheet here
}
So as you see I can get each sheet of Excel, but how can I get workparts in each sheet, like my 3 tables I should can iterate on these tables, does any one know about this? any suggestion?

Does this help?
// true for editable
using (SpreadsheetDocument xl = SpreadsheetDocument.Open("yourfile.xlsx", true))
{
foreach (WorksheetPart wsp in xl.WorkbookPart.WorksheetParts)
{
foreach (TableDefinitionPart tdp in wsp.TableDefinitionParts)
{
// for example
// tdp.Table.AutoFilter = new AutoFilter() { Reference = "B2:D3" };
}
}
}
Note that the actual cell data is not in the Table object, but in SheetData (under Worksheet of the WorksheetPart). Just so you know.

You can get the specific table from excel. Adding more to the answer of #Vincent
using (SpreadsheetDocument document= SpreadsheetDocument.Open("yourfile.xlsx", true))
{
var workbookPart = document.WorkbookPart;
var relationsShipId = workbookPart.Workbook.Descendants<Sheet>()
.FirstOrDefault(s => s.Name.Value.Trim().ToUpper() == "your sheetName")?.Id;
var worksheetPart = (WorksheetPart)workbookPart.GetPartById(relationsShipId);
TableDefinitionPart tableDefinitionPart = worksheetPart.TableDefinitionParts
.FirstOrDefault(r =>
r.Table.Name.Value.ToUpper() =="your Table Name");
QueryTablePart queryTablePart = tableDefinitionPart.QueryTableParts.FirstOrDefault();
Table excelTable = tableDefinitionPart.Table;
var newCellRange = excelTable.Reference;
var startCell = newCellRange.Value.Split(':')[0]; // you can have your own logic to find out row and column with this values
var endCell = newCellRange.Value.Split(':')[1];// Then you can use them to extract values using regular open xml
}

OpenXML (SAX Method) - Adding row to existing tab

I am trying to create an Excel document using OpenXML (SAX method). When my method is called I want to check to see if a tab has already been created for a given key. If it is I would like to just append a row to the bottom of that tab. If the tab hasn't been created for a given key I create a new tab like;
part = wbPart.AddNewPart<WorksheetPart>();
string worksheetName = row.Key[i].ToString();
Sheet sheet = new Sheet() { Id = document.WorkbookPart.GetIdOfPart(part), SheetId = sheetNumber, Name = worksheetName };
sheets.Append(sheet);
writer = OpenXmlWriter.Create(part);
writer.WriteStartElement(new Worksheet());
writer.WriteStartElement(new SheetData());
currentrow = 1;
string header = Header + "\t" + wrapper.GetHeaderString(3, 2, -1); //need to fix
WriteDataToExcel(header, currentrow, 0, writer);
currentrow++;
writer.WriteEndElement();
writer.WriteEndElement();
writer.Close();
If the a tab as already been created I recall sheet using the following code;
private static WorksheetPart GetWorksheetPartByName(SpreadsheetDocument document, string sheetName)
{
IEnumerable<Sheet> sheets =
document.WorkbookPart.Workbook.GetFirstChild<Sheets>().
Elements<Sheet>().Where(s => s.Name == sheetName);
if (sheets.Count() == 0)
{
// The specified worksheet does not exist.
return null;
}
string relationshipId = sheets.First().Id.Value;
WorksheetPart worksheetPart = (WorksheetPart)
document.WorkbookPart.GetPartById(relationshipId);
return worksheetPart;
}
When the correct Worksheet part is returned I try and add the new row by pointing my OpenXmlWriter to the correct part then adding the row;
part = GetWorksheetPartByName(document, row.Key[i].ToString());
writer = OpenXmlWriter.Create(part);
writer.WriteStartElement(part.Worksheet);
writer.WriteStartElement(part.Worksheet.GetFirstChild<SheetData>());
SheetData sheetData = part.Worksheet.GetFirstChild<SheetData>();
Row lastRow = sheetData.Elements<Row>().LastOrDefault();
The code runs however I always end up with just one row (the initial one I added when first creating the tab). No subsequent rows show up in the spreadsheet.
I will be adding a lot of rows (50,000+) and would prefer not to have to create a new file and copy the information over each time.

From my experience, using the SAX method to write (ie, with OpenXmlWriter) works best for new things (parts, worksheets, whatnot). When you use OpenXmlWriter.Create(), that's like overwriting the original existing data for the part (WorksheetPart in this case). Even though in effect, it's not. It's complicated.
As far as my experiments went, if there's existing data, you can't edit data using OpenXmlWriter. Not even if you use the Save() function or close the OpenXmlWriter correctly. For some reason, the SDK will ignore your efforts. Hence the original one row that you added.
If you're writing 50,000 rows, it's best to do so all at one go. Then the SAX method will be useful. Besides, if you're writing one row (at a time?), the speed benefits of using SAX versus the DOM method is negligible.

According to this site work with exist Excel with OpenXMLWriter :
OpenXMLWriter can only operate a new Worksheet instead of an existing document. So I'm afraid you cannot insert values into particular cells of existing spreadsheet using OpenXMLWriter.
You could read all data in your exist Excel file , then seems you need to add rows(50,000+) I recommend use openxmlwriter to write old and new data to a new Excel file at once. If you use DOM approach it might cause memory problem after you append a lot of rows(50,000+).

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Handling large Excel files with shared strings - c#

Related

Parsing cell value of Excel spreadsheet

How to uppercase all the words in excel spreadsheet using NPOI C#

Reading an uploaded Excel file

Get Tables (workparts) of a sheet of excel by OpenXML SDK

OpenXML (SAX Method) - Adding row to existing tab

Categories

Resources