excel xlsx file parsing - using koogra - c#

After trying few packages with git hub, and trying to parse/process this quite a large excel document.
Each one of methods I was trying throw exception on out of memory.
I was google ing some more and found this GNU Library named koogra which seems to be only one I could see fit for the job, couldn't bother too much and continue on searching as I am running out of time for this part of the project .
The code I have got by now is working pass the part of the "out of memory" issue,
so only thing left is how do I properly parse an Excel Document so it will be possible to extract say a kind of dictionary collection key is one column and value is another.
this is the file in question
this is the code i have so far
var path = Path.Combine(Environment.CurrentDirectory, "tst.xlsx");
Net.SourceForge.Koogra.Excel2007.Workbook xcel = new Net.SourceForge.Koogra.Excel2007.Workbook(path);
var ss = xcel.GetWorksheets();

found it by some more .... google ing...
first row for usage on 2007 (xlsx)
second row is for xls version
Net.SourceForge.Koogra.IWorkbook genericWB = Net.SourceForge.Koogra.WorkbookFactory.GetExcel2007Reader("tst.xlsx");
//genericWB = Net.SourceForge.Koogra.WorkbookFactory.GetExcelBIFFReader("some.xls");
Net.SourceForge.Koogra.IWorksheet genericWS = genericWB.Worksheets.GetWorksheetByIndex(0);
for (uint r = genericWS.FirstRow; r <= genericWS.LastRow; ++r)
{
Net.SourceForge.Koogra.IRow row = genericWS.Rows.GetRow(r);
for (uint c = genericWS.FirstCol; c <= genericWS.LastCol; ++c)
{
// raw value
Console.WriteLine(row.GetCell(c).Value);
// formatted value
Console.WriteLine(row.GetCell(c).GetFormattedValue());
}
}
i hope that i helped anyone else out there that encountered same "out of memory" issue ... '
enjoy
a small update to the code above
OK.. I Have played with this a little , so as far as it is related to the content of the file
the chart is ranked based on Unique IP and the current code is
//place source file within your current:
//project directory\bin\debug and you should find extracted file next to the source file
var pathtoRead = Path.Combine(Environment.CurrentDirectory, "tst.xlsx");
var pathtoWrite = Path.Combine(Environment.CurrentDirectory, "tst.txt");
Net.SourceForge.Koogra.IWorkbook genericWB = Net.SourceForge.Koogra.WorkbookFactory.GetExcel2007Reader(pathtoRead);
Net.SourceForge.Koogra.IWorksheet genericWS = genericWB.Worksheets.GetWorksheetByIndex(0);
StringBuilder SbXls = new StringBuilder();
for (uint r = genericWS.FirstRow; r <= genericWS.LastRow; ++r)
{
Net.SourceForge.Koogra.IRow row = genericWS.Rows.GetRow(r);
string LineEnding = string.Empty;
for (uint ColCount = genericWS.FirstCol; ColCount <= genericWS.LastCol; ++ColCount)
{
var formated = row.GetCell(ColCount).GetFormattedValue();
if (ColCount == 1)
LineEnding = Environment.NewLine;
else if (ColCount == 0)
LineEnding = "\t";
if (ColCount > 1 == false)
SbXls.Append(string.Concat(formated, LineEnding));
}
}
File.WriteAllText(pathtoWrite, SbXls.ToString());

Related

Showing #value! before enable editing on excel if I write formula using epplus

Using C# .net core I am updating existing excel template with Data and formulas using EPPlus lib 4.5.3.3.
If you see the below screen shots all formula cells has '#value!' even after using calculate method in C# code (Just for reference attached xml screen short just after downloading excel before opening it). Auto calculation is also enabled in Excel.
In one of the blog mentioned to check the xml info,
My requirement is to upload this excel through code to sharepoint site and read the excel formula cells for other operations with out opening the excel manually.
is there any other way to calculate the formula cells form code and update the cell values?
I went through the Why won't this formula calculate unless i double click a cell? as well, but no luck.
using (ExcelPackage p = new ExcelPackage())
{
MemoryStream stream = new MemoryStream(byteArray);
p.Load(stream);
ExcelWorksheet worksheet = p.Workbook.Worksheets.FirstOrDefault(a => a.Name == "InputTemplate");
worksheet.Calculate();
if (worksheet != null)
{
worksheet.Cells["A3"].Value = company.CompanyName;//// Company Name
worksheet.Cells["B3"].Value = product.Name;////peoduct name
worksheet.Cells["C3"].Value = product.NetWeight;
worksheet.Cells["D3"].Value = product.ServingSize;
worksheet.Cells["E3"].Value = 0;
var produceAndIngredientDetailsForExcelList = await GetProduceAndIngredientDetails(companyId, productId);
////rowIndex will be 3
WriteProduceAndIngredientDetailsInExcel(worksheet, produceAndIngredientDetailsForExcelList);
///rowIndex will update based on no. of produce and then Agregates.
StageWiseAggregate(worksheet, produceAndIngredientDetailsForExcelList);
////Write Total Impacts Row
TotalImpactsFormulaSection(worksheet);
worksheet.Calculate();
}
Byte[] bin = p.GetAsByteArray();
return bin;
}
Formula Code
var columnIndex = 22;///"V" Column
for (; columnIndex <= 27; columnIndex++)
{
var columnName = GetExcelColumnName(columnIndex);
worksheet.Cells[currentRowIndex, columnIndex].Formula = $"=SUBTOTAL(109,{columnName}{firstRowIndex}:{columnName}{currentRowIndex - 1})";
}
Found the solution for this issue from my Architect (kudos to him).
I was writing formulas in wrong way by blindly fallowing tutorials like
https://riptutorial.com/epplus/example/26433/add-formulas-to-a-cell
Note: don't follow link shown above.
We should not use "=" for formulas. I just removed it worked like charm
var columnIndex = 22;///"V" Column
for (; columnIndex <= 27; columnIndex++)
{
var columnName = GetExcelColumnName(columnIndex);
worksheet.Cells[currentRowIndex, columnIndex].Formula = $"SUBTOTAL(109,{columnName}{firstRowIndex}:{columnName}{currentRowIndex - 1})";
}
Here is the official tutorial which mentioned correctly.
https://www.epplussoftware.com/en/Developers/ (check the second slide)
Working result:

How to take part in excel files

Hello sorry for my english.
I have to select a row of a excel file, put any new data and save them.
In the end I see that the excel file is always larger than before although the data are not increased but it looks to be created of the blank columns to the right.
I think this because when I execute the following statement
var wb = openWorkBook(filename);
var ws = wb.Worksheet("CNF");
IXLRow row = ws.Row(device.Ordinal - 1 + FirstRow);
for (int j = 0; j < MAXCOLS; ++j)
{
IXLCell cell = row.Cell(j + FirstCol);
...}
as range goes from A1 to XFD1048576.
Although after I take the line of my interest and cycle of 100 columns when I go
wb.Save();
the file increases.
So I ask you if you have a method to take only a part of a file then for example take already suffered from a limited number of columns, starting from education var ws = wb.Worksheet("CNF");.
Thank you

How to maintain style formatting when merging two ODT documents together

I am working with the AODL library for C#. So far I have been able to wholesale import the text of the second document into the first. The issue is I can't quite figure out what I need to grab to make sure the styling is also moved over to the merged document. Below is the simple code I'm using to test. The closest answer I can find is Merging two .odt files from code, which somewhat answers my question, but it still doesn't tell me where I need to put the styling/ where to get it from. It at least lets me know that I need to go through the styles in the second document and make sure there are not matching names in the first otherwise there will be conflicts. I'm not sure exactly what to do, and documentation has been very slim. Before you suggest anything I would like to let you know that, yes, odt is the filetype I need to work with, and doing any kind of interop stuff like Microsoft does with Word is not what I'm after. If there is another library out there that works similarly to AODL I'm all ears.
TextDocument mergeTemplateDoc = ReadContentsOfFile(mergeTemplateFileName);
TextDocument vehicleTemplateDoc = ReadContentsOfFile(vehicleTemplateFileName);
foreach (IContent piece in vehicleTemplateDoc.Content)
{
XmlNode newNode = mergeTemplateDoc.XmlDoc.ImportNode(piece.Node,true);
Paragraph p = ParagraphBuilder.CreateParagraphWithExistingNode(mergeTemplateDoc, newNode);
mergeTemplateDoc.Content.Add(p);
}
mergeTemplateDoc.SaveTo("MergComplete.odt");
Here is what I ended up doing to solve my issue. Keep in mind I have since migrated to using Java since this question was asked, as the library appears to work a little better in that language.
Essentially what the methods below are doing is Grabbing the Automatic Styles that are generated in each document. It iterates through the second document and finds each style node, checking for the name attribute. That name is then tagged with an extra identifier that is unique to that document, so when they are merged together they won't conflict name wise.
The mergeFontTypesToPrimaryDoc just grabs the fonts that don't already exist in the primary doc since all the fonts are referenced in the same way in the documents there is no editing to be done.
The updateNodeChildrenStyleNames is just a recursive method that I used to make sure I get all the in line style nodes updated to remove any conflicting names between the two documents.
This similar idea should work in C# as well.
private static void mergeStylesToPrimaryDoc(OdfTextDocument primaryDoc, OdfTextDocument secondaryDoc) throws Exception {
OdfFileDom primaryContentDom = primaryDoc.getContentDom();
OdfOfficeAutomaticStyles primaryDocAutomaticStyles = primaryDoc.getContentDom().getAutomaticStyles();
OdfOfficeAutomaticStyles secondaryDocAutomaticStyles = secondaryDoc.getContentDom().getAutomaticStyles();
//Adopt style nodes from secondary doc
for(int i =0; i<secondaryDocAutomaticStyles.getLength();i++){
Node style = secondaryDocAutomaticStyles.item(i).cloneNode(true);
if(style.hasAttributes()){
NamedNodeMap attributes = style.getAttributes();
for(int j=0; j< attributes.getLength();j++){
Node a = attributes.item(j);
if(a.getLocalName().equals("name")){
a.setNodeValue(a.getNodeValue()+_stringToAddToStyle);
}
}
}
if(style.hasChildNodes()){
updateNodeChildrenStyleNames(style, _stringToAddToStyle, "name");
}
primaryDocAutomaticStyles.appendChild(primaryContentDom.adoptNode(style));
}
}
private static void mergeFontTypesToPrimaryDoc(OdfTextDocument primaryDoc, OdfTextDocument secondaryDoc) throws Exception {
//Insert referenced font types that are not in the primary document you are merging into
NodeList sdDomNodes = secondaryDoc.getContentDom().getChildNodes().item(0).getChildNodes();
NodeList pdDomNodes = primaryDoc.getContentDom().getChildNodes().item(0).getChildNodes();
OdfFileDom primaryContentDom = primaryDoc.getContentDom();
Node sdFontNode=null;
Node pdFontNode=null;
for(int i =0; i<sdDomNodes.getLength();i++){
if(sdDomNodes.item(i).getNodeName().equals("office:font-face-decls")){
sdFontNode = sdDomNodes.item(i);
break;
}
}
for(int i =0; i<pdDomNodes.getLength();i++){
Node n =pdDomNodes.item(i);
if(n.getNodeName().equals("office:font-face-decls")){
pdFontNode = pdDomNodes.item(i);
break;
}
}
if(sdFontNode !=null && pdFontNode != null){
NodeList sdFontNodeChildList = sdFontNode.getChildNodes();
NodeList pdFontNodeChildList = pdFontNode.getChildNodes();
List<String> fontNames = new ArrayList<String>();
//Get list of existing fonts in primary doc
for(int i=0; i<pdFontNodeChildList.getLength();i++){
NamedNodeMap attributes = pdFontNodeChildList.item(i).getAttributes();
for(int j=0; j<attributes.getLength();j++){
if(attributes.item(j).getLocalName().equals("name")){
fontNames.add(attributes.item(j).getNodeValue());
}
}
}
//Check each font in the secondary doc to make sure it gets added if the primary doesn't have it
for(int i=0; i<sdFontNodeChildList.getLength();i++){
Node fontNode = sdFontNodeChildList.item(i).cloneNode(true);
NamedNodeMap attributes = fontNode.getAttributes();
String fontName="";
for(int j=0; j< attributes.getLength();j++){
if(attributes.item(j).getLocalName().equals("name")){
fontName = attributes.item(j).getNodeValue();
break;
}
}
if(!fontName.equals("") && !fontNames.contains(fontName)){
pdFontNode.appendChild(primaryContentDom.adoptNode(fontNode));
}
}
}
}
private static void updateNodeChildrenStyleNames(Node n, String stringToAddToStyle, String nodeLocalName){
NodeList childNodes = n.getChildNodes();
for (int i=0; i< childNodes.getLength(); i++){
Node currentChild = childNodes.item(i);
if(currentChild.hasAttributes()){
NamedNodeMap attributes = currentChild.getAttributes();
for(int j =0; j < attributes.getLength(); j++){
Node a = attributes.item(j);
if(a.getLocalName().equals(nodeLocalName)){
a.setNodeValue(a.getNodeValue() + stringToAddToStyle);
}
}
}
if(currentChild.hasChildNodes()){
updateNodeChildrenStyleNames(currentChild, stringToAddToStyle, nodeLocalName);
}
}
}
}
I do not know how precisely it should be coded, but using 7zip i have been able to just copy the whole styles.xml from one file to another. Programatically it should be just as easy.
I always format my files with styles and never with direct formatting. So just replacing any file is prone to eliminate the local styles.
I found this answer (to the question "Cleaning a stylesheet of unused styles") https://www.mobileread.com/forums/showpost.php?s=cbbee08a1204df71ec5cd88bcf222253&p=2100914&postcount=13
which iterates through all the styles in one document. It doesn't show how to incorporate one into the other, but the backbone is clear.
'---------------------------------------------------------- 03/02/2012
' Supprimer les styles personnalisés inutilisés
' d'un document texte ou d'un classeur
'---------------------------------------------------------------------
sub stylesPersoInutiles()
dim coStylesFamilles as object, oStyleFamille as object
dim oStyle as object, nomFamille as string
dim f as long, x as long
dim ts(), buf as string, iRet as integer
const SEP = ", "
coStylesFamilles = thisComponent.StyleFamilies
for f = 0 to coStylesFamilles.count -1
' Pour chaque famille
nomFamille = coStylesFamilles.elementNames(f)
oStyleFamille = coStylesFamilles.getByName(nomFamille)
buf = ""
for x = 0 to oStyleFamille.Count -1
' Pour chaque style
oStyle = oStyleFamille(x)
'xray oStyle
if (oStyle.isUserDefined) and (not oStyle.isInUse) then
buf = buf & oStyle.name & SEP
end if
next x
if len(buf) > len(SEP) then
buf = left(buf, len(buf) - len(SEP))
iRet = msgBox("Styles personnalisés non utilisés : " _
& chr(13) & buf & chr(13) & chr(13) _
& "Faut-il les détruire ?", 4+32+256, nomFamille)
if iRet = 6 then
ts = split(buf, SEP)
for x = 0 to uBound(ts)
oStyleFamille.removeByName(ts(x))
next x
end if
end if
next f
end sub

How to add items one at a time to to a new line a word document using word interop

I am trying to add these three types of content into a word doc. This is how I am trying to do it now. However, each item replaces the last one. Adding images always adds to the beginning of the page. I have a loop that calls a function to create the headers and tables, and then adds images after. I think the problem is ranges. I use a starting range of object start = 0;
How can I get these to add one at a time to to a new line in the document?
foreach (var category in observedColumns)
{
CreateHeadersAndTables();
createPictures();
}
Adding Headers:
object start = 0;
Word.Range rng = doc.Range(ref start , Missing.Value);
Word.Paragraph heading;
heading = doc.Content.Paragraphs.Add(Missing.Value);
heading.Range.Text = category;
heading.Range.InsertParagraphAfter();
Adding Tables:
Word.Table table;
table = doc.Content.Tables.Add(rng, 1, 5);
Adding Pictures:
doc.Application.Selection.InlineShapes.AddPicture(#path);
A simple approach will be using paragraphs to handle the Range objects and simply insert a new paragraph one by one.
Looking at the API documentation reveals that Paragraphs implements an Add method which:
Returns a Paragraph object that represents a new, blank paragraph
added to a document. (...) If Range isn't specified, the new paragraph is added after the selection or range or at the end of the document.
Source: http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word.paragraphs.add(v=office.14).aspx
In that way, it gets straight forward to append new content to the document.
For completeness I have included a sample that shows how a solution might work. The sample loops through a for loop, and for each iteration it inserts:
A new line of text
A table
A picture
The sample has is implemented as a C# console application using:
.NET 4.5
Microsoft Office Object Library version 15.0, and
Microsoft Word Object Library version 15.0
... that is, the MS Word Interop API that ships with MS Office 2013.
using System;
using System.IO;
using Microsoft.Office.Interop.Word;
using Application = Microsoft.Office.Interop.Word.Application;
namespace StackOverflowWordInterop
{
class Program
{
static void Main()
{
// Open word and a docx file
var wordApplication = new Application() { Visible = true };
var document = wordApplication.Documents.Open(#"C:\Users\myUserName\Documents\document.docx", Visible: true);
// "10" is chosen by random - select a value that fits your purpose
for (var i = 0; i < 10; i++)
{
// Insert text
var pText = document.Paragraphs.Add();
pText.Format.SpaceAfter = 10f;
pText.Range.Text = String.Format("This is line #{0}", i);
pText.Range.InsertParagraphAfter();
// Insert table
var pTable = document.Paragraphs.Add();
pTable.Format.SpaceAfter = 10f;
var table = document.Tables.Add(pTable.Range, 2, 3, WdDefaultTableBehavior.wdWord9TableBehavior);
for (var r = 1; r <= table.Rows.Count; r++)
for (var c = 1; c <= table.Columns.Count; c++)
table.Cell(r, c).Range.Text = String.Format("This is cell {0} in table #{1}", String.Format("({0},{1})", r,c) , i);
// Insert picture
var pPicture = document.Paragraphs.Add();
pPicture.Format.SpaceAfter = 10f;
document.InlineShapes.AddPicture(Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments), "img_1.png"), Range: pPicture.Range);
}
// Some console ascii-UI
Console.WriteLine("Press any key to save document and close word..");
Console.ReadLine();
// Save settings
document.Save();
// Close word
wordApplication.Quit();
}
}
}

Export the tables from pdf to excel?

How do i export only the table contents to excel file through C# programming?
I am currently extracting all the contents from PDFs using PDFNET SDK ,but couldn't able to read the table as a tabular structure
I know have not used the SDK for this product, but I have used the stand alone product. It read the content of a PDF into a spreadsheet (many export options).
The product is OmniPage by Nuance http://australia.nuance.com/for-business/by-product/omnipage/index.htm.
there is an SDK with free evaluation.
Using bytescount PDF Extractor SDK we can be able to extract the whole page as below,
CSVExtractor extractor = new CSVExtractor();
extractor.RegistrationName = "demo";
extractor.RegistrationKey = "demo";
TableDetector tdetector = new TableDetector();
tdetector.RegistrationKey = "demo";
tdetector.RegistrationName = "demo";
// Load the document
extractor.LoadDocumentFromFile("C:\\sample.pdf");
tdetector.LoadDocumentFromFile("C:\\sample.pdf");
int pageCount = tdetector.GetPageCount();
for (int i = 1; i <= pageCount; i++)
{
int j = 1;
do
{
extractor.SetExtractionArea(tdetector.GetPageRect_Left(i),
tdetector.GetPageRect_Top(i),
tdetector.GetPageRect_Width(i),
tdetector.GetPageRect_Height(i)
);
// and finally save the table into CSV file
extractor.SavePageCSVToFile(i, "C:\\page-" + i + "-table-" + j + ".csv");
j++;
} while (tdetector.FindNextTable()); // search next table
}
since it is an old post, hope it would help others.
Above answer(John) works,it is really useful.
But i use bytescount PDF Extrator SDK tools instead of using code.
By the way,the tool will generate a lot of sheet in one excel file.
You can use code below in excel to generate as one sheet.
Sub ConvertAsOne()
Application.ScreenUpdating = False
For j = 1 To Sheets.Count
If Sheets(j).Name <> ActiveSheet.Name Then
X = Range("A65536").End(xlUp).Row + 1
Sheets(j).UsedRange.Copy Cells(X, 1)
End If
Next
Range("B1").Select
Application.ScreenUpdating = True
MsgBox "succeed!", vbInformation, "note"
End Sub

Categories

Resources