I am building a quick proof of concept project to show the ability to parse an excel file. Based on Microsoft documentation (How to parse and read a large spreadsheet document), it seems the sdk of choice is Open Xml.
The proof of concept gives a basic form for a user to upload a file (.xlsx). The controller reads the file and spits back the content. I am struggling to grab the value of the cell, instead, it seems like I am only able to get some sort of identifier or reference to the text. Here is my code with some examples
View
#using(Html.BeginForm("Index", "Home", FormMethod.Post, new{ enctype="multipart/form-data" } ))
{
<input type="file" name="file"/>
<input type="submit"/>
}
<br/>
#Html.Raw(ViewBag.Text)
Action
[HttpPost]
public ActionResult Index(HttpPostedFileBase file)
{
ViewBag.Text = "";
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(file.InputStream, false))
{
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();
string text;
foreach (Row r in sheetData.Elements<Row>())
{
foreach (Cell c in r.Elements<Cell>())
{
text = c.CellValue.Text;
ViewBag.Text += text + ", ";
}
ViewBag.Text += "<br />";
}
}
return this.View();
}
Excel File
| Hello | World |
| World | Hello |
Output
0, 1,
1, 0,
As you can see, the 0 represents "Hello" and the 1 represents "World". I've tested this with a larger data set and have confirmed that identical words have the same value when printed to the screen.
The example is pretty much copy/pasted from the MS website. I've tried accessing other c.CellValue properties, such as InnerText and InnerXml only to get the same results. What am I doing wrong? Is Open XML is good SDK to use for this purpose?
Use this method to get the exact value of the cell instead of numbers:
private string ReadExcelCell(Cell cell, WorkbookPart workbookPart)
{
var cellValue = cell.CellValue;
var text = (cellValue == null) ? cell.InnerText : cellValue.Text;
if ((cell.DataType != null) && (cell.DataType == CellValues.SharedString))
{
text = workbookPart.SharedStringTablePart.SharedStringTable
.Elements<SharedStringItem>().ElementAt(
Convert.ToInt32(cell.CellValue.Text)).InnerText;
}
return (text ?? string.Empty).Trim();
}
Related
I parse cell located in A2 address. This returns 3 value instead of the expected Category 1.
test.xlsx
using System;
using System.Linq;
using DocumentFormat.OpenXml.Packaging;
using X = DocumentFormat.OpenXml.Spreadsheet;
namespace DotNetSandbox.SO
{
public class IncorrectCellValue
{
public static void ParseCellValue()
{
using SpreadsheetDocument doc = SpreadsheetDocument.Open(#"c:\temp\test.xlsx", false);
X.Sheet sheet = doc.WorkbookPart.Workbook.Descendants<X.Sheet>().First();
WorksheetPart wsPart = (WorksheetPart)doc.WorkbookPart.GetPartById(sheet.Id);
X.Cell cell = wsPart.Worksheet.Descendants<X.Cell>().First(c => c.CellReference == "A2");
string cellValue = cell.CellValue.Text;
Console.WriteLine(cellValue);
Console.ReadKey();
}
}
}
OUTPUT:
3
Target: .NET 5
DocumentFormat.OpenXml version: 2.13.0
I do something wrong or maybe is it a library bug?
Use this method
public static string GetCellValue(string fileName,
string addressName, string sheetName = "")
{
string value = null;
// Open the spreadsheet document for read-only access.
using (SpreadsheetDocument document =
SpreadsheetDocument.Open(fileName, false))
{
// Retrieve a reference to the workbook part.
WorkbookPart wbPart = document.WorkbookPart;
// Find the sheet with the supplied name, and then use that
// Sheet object to retrieve a reference to the first worksheet.
var theSheets = wbPart.Workbook.Descendants<Sheet>();
Sheet theSheet = string.IsNullOrEmpty(sheetName) ? theSheets.FirstOrDefault() : theSheets.FirstOrDefault(x => x.Name == sheetName);
// Throw an exception if there is no sheet.
if (theSheet == null)
{
throw new ArgumentException("sheetName");
}
// Retrieve a reference to the worksheet part.
WorksheetPart wsPart =
(WorksheetPart)(wbPart.GetPartById(theSheet.Id));
// Use its Worksheet property to get a reference to the cell
// whose address matches the address you supplied.
Cell theCell = wsPart.Worksheet.Descendants<Cell>().
Where(c => c.CellReference == addressName).FirstOrDefault();
// If the cell does not exist, return an empty string.
if (theCell.InnerText.Length > 0)
{
value = theCell.InnerText;
// If the cell represents an integer number, you are done.
// For dates, this code returns the serialized value that
// represents the date. The code handles strings and
// Booleans individually. For shared strings, the code
// looks up the corresponding value in the shared string
// table. For Booleans, the code converts the value into
// the words TRUE or FALSE.
if (theCell.DataType != null)
{
switch (theCell.DataType.Value)
{
case CellValues.SharedString:
// For shared strings, look up the value in the
// shared strings table.
var stringTable =
wbPart.GetPartsOfType<SharedStringTablePart>()
.FirstOrDefault();
// If the shared string table is missing, something
// is wrong. Return the index that is in
// the cell. Otherwise, look up the correct text in
// the table.
if (stringTable != null)
{
value =
stringTable.SharedStringTable
.ElementAt(int.Parse(value)).InnerText;
}
break;
case CellValues.Boolean:
switch (value)
{
case "0":
value = "FALSE";
break;
default:
value = "TRUE";
break;
}
break;
}
}
}
}
return value;
}
You stuck here:
If the cell represents an integer number, you are done.
For dates, this code returns the serialized value that
represents the date. The code handles strings and
Booleans individually. For shared strings, the code
looks up the corresponding value in the shared string
table. For Booleans, the code converts the value into
the words TRUE or FALSE.
I was able to get Category 1 by running this code:
var cellValue = GetCellValue(#"c:\test.xlsx", "A2");
Microsoft Doc
Notice that I changed the original method to get the first sheet if you do not pass the sheet name to the method.
What is Shared String:
To optimize the use of strings in a spreadsheet, SpreadsheetML stores a single instance of the string in a table called the shared string table. The cells then reference the string by index instead of storing the value inline in the cell value. Excel always creates a shared string table when it saves a file.
I am using NPOI library in order to read .xls and .xlsx files.
However I have this issue, the method GetRow() does not return null even when the row is empty.
Here is the code
int idx_row = 1;
IRow currentRow = sheet.GetRow(idx_row);
while (currentRow != null)
{
JObject jsonData = new JObject();
jsonData["a"] = sheet.GetRow(idx_row).GetCell(0).StringCellValue.Replace(" ", "");
// other similar code
jsonPlateData.Add(jsonData);
idx_row++;
currentRow = sheet.GetRow(idx_row);
}
Check the value of sheet.LastRowNum, it's possible that the row seems empty but excel considers it as filled. If so, open the excel file and delete the rows that are "empty".
Using OpenXML, Microsoft recommends using the SAX approach:
https://msdn.microsoft.com/en-us/library/office/gg575571.aspx
So rather than loading the whole document DOM in memory, you can read the file serially with OpenXmlReader. For example:
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
string text;
while (reader.Read())
{
if (reader.ElementType == typeof(CellValue))
{
text = reader.GetText();
Console.Write(text + " ");
}
}
But this kinda falls down when you have cells with the SharedString data type. Those are stored separate from the sheet data in the shared string table and, as far as I can see, there's no real way to avoid having to load the entire shared string table. For example, I can do this:
var sharedStrings = wbPart.SharedStringTablePart.SharedStringTable.Cast<SharedStringItem>()
.Select(i => i.Text.Text).ToArray();
And then I can do something like:
var row = reader.LoadCurrentElement() as Row;
var cells = row.Descendants<Cell>();
var cellValues = cells.Select(c => c.DataType != null
&& c.DataType == CellValues.SharedString ?
sharedStrings[int.Parse(c.CellValue.Text)] : c.CellValue.Text).ToArray();
Which works, but I had to load the entire shared string table, which could be very large if the file has a lot of unique strings. Is there a more efficient way handle looking up the shared strings as your process each row of the file?
I am using C# in my project. I have a long XML file. I want to import all of them at once in a CSV file. I am trying this by writing the following code, But there is mismatch inside column. Next column value comes previously. Suddenly I noticed that for some attributes (For example Note), the text is written with semicolon instead of Comman and as a result this text set in three columns instead of one.
Example "Review VAT query; draft simple VAT agreement; review law and reply to queries".How can I ingore Semicolon of that properties.
Here is my code.
var output = new StringBuilder();
output.AppendLine("EmployeeId;EmployeeFirstName;EmployeeLastName;AllocationId;TaskId;TaskName;ProjectName;CustomerName;InvoiceAmount;WorkHours");
if (workUnit != null)
{
foreach (XmlNode customer in workUnit)
{
var unit = new WorkUnit();
var childNodes = customer.SelectNodes("./*");
if (childNodes != null)
{
for (int i = 0; i < childNodes.Count; ++i)
{
XmlNode childNode = childNodes[i];
output.Append(childNode.InnerText);
if (i < childNodes.Count - 1)
output.Append(";");
}
}
output.Append(Environment.NewLine);
}
Console.WriteLine(output.ToString());
File.AppendAllText("c:\\..WorkUnits.csv", output.ToString());
}
You could try to use the StringToCSVCell method defined by #Ed Bayiates here for to escape any semi-colon in the cell values:
escaping tricky string to CSV format
XmlNode childNode = childNodes[i];
output.Append(StringToCSVCell(childNode.InnerText));
if (i < childNodes.Count - 1)
output.Append(";");
I have 3 tables in a sheet of excel file,
and I use OpenXML SDK to read the Excel file, like this:
SpreadSheetDocument document = SpreadSheetDDocument.open(/*read it*/);
foreach(Sheet sheet in document.WorkbookPart.Workbook.Sheets)
{
//I need each table or work part of sheet here
}
So as you see I can get each sheet of Excel, but how can I get workparts in each sheet, like my 3 tables I should can iterate on these tables, does any one know about this? any suggestion?
Does this help?
// true for editable
using (SpreadsheetDocument xl = SpreadsheetDocument.Open("yourfile.xlsx", true))
{
foreach (WorksheetPart wsp in xl.WorkbookPart.WorksheetParts)
{
foreach (TableDefinitionPart tdp in wsp.TableDefinitionParts)
{
// for example
// tdp.Table.AutoFilter = new AutoFilter() { Reference = "B2:D3" };
}
}
}
Note that the actual cell data is not in the Table object, but in SheetData (under Worksheet of the WorksheetPart). Just so you know.
You can get the specific table from excel. Adding more to the answer of #Vincent
using (SpreadsheetDocument document= SpreadsheetDocument.Open("yourfile.xlsx", true))
{
var workbookPart = document.WorkbookPart;
var relationsShipId = workbookPart.Workbook.Descendants<Sheet>()
.FirstOrDefault(s => s.Name.Value.Trim().ToUpper() == "your sheetName")?.Id;
var worksheetPart = (WorksheetPart)workbookPart.GetPartById(relationsShipId);
TableDefinitionPart tableDefinitionPart = worksheetPart.TableDefinitionParts
.FirstOrDefault(r =>
r.Table.Name.Value.ToUpper() =="your Table Name");
QueryTablePart queryTablePart = tableDefinitionPart.QueryTableParts.FirstOrDefault();
Table excelTable = tableDefinitionPart.Table;
var newCellRange = excelTable.Reference;
var startCell = newCellRange.Value.Split(':')[0]; // you can have your own logic to find out row and column with this values
var endCell = newCellRange.Value.Split(':')[1];// Then you can use them to extract values using regular open xml
}