Split large excel into multiple excel file using azure function

Split large excel into multiple excel file using azure function - c#

I'm able to split excel file in function but when publishing on azure function is giving timeout exception. what to do.how azure durable functions can help here?
This is how i'm doing it:
bookOriginal.LoadFromStream(BlobService.GetFileFromBlob(filename));
log.LogInformation("File read from Azure Blob");
Worksheet sheet = bookOriginal.Worksheets[0];
var totalRow = sheet.Rows.Count();
int splitRows = 7000;
int count = totalRow / splitRows;
for (int i = 1; i <= count; i++)
{
CellRange range1;
Workbook newBook1 = new Workbook();
newBook1.CreateEmptySheets(1);
Worksheet newSheet1 = newBook1.Worksheets[0];
Model localModel = new Model();
if (i == 1)
{
range1 = sheet.Range[2, 1, splitRows, sheet.LastColumn];
}
else
{
range1 = sheet.Range[(splitRows * (i - 1)) + 1, 1, splitRows * i, sheet.LastColumn];
}
newSheet1.Copy(range1, newSheet1.Range[1, 1]);
//bookOriginal.SaveToFile("Research and Development.xlsx", ExcelVersion.Version2007);
localModel.workbookObject = newBook1;
model.Add(localModel);
}
Console.WriteLine("Ran Completely");

Yes durable functions can surely help you!
You can take a look at this link https://learn.microsoft.com/it-it/azure/azure-functions/durable/durable-functions-overview?tabs=csharp
The first and the second pattern could help you. The Project structure can be:
a blob triggered function that downloads the source excel, converting it into one object that you can pass as input invoking the orchestrator .
The orchestrator function Deserializes the input object and groups the rows as you did in your code
inside a foreach statement you can use the current group of rows as parameter to invoke an activity. You can choose if activities will run in sequence (as pattern 1 awaiting activity) or run in parallel (As pattern 2 using Task.WhenAll)
The activity function Converts the row group into an excel File and, using blob attribute as output, uploads it into storage
WARNING: durable Functions documentation sayes: Return values are serialized to JSON and persisted to the orchestration history table in Azure Table storage.
So The input model must be serializable as json.

Related

In C# how do I go through a Google Sheets document and write into a specific cell

Following a tutorial, I have set up everything that needs to be set up for Google Sheets Api v4. In my Google Sheets documetnt, I have names of students in the first column, and in my second column I want to put their GPA. In my code, I made two variables that the user inputs, string name and string gpa. I want to go through column A, look for that name and insert that GPA next to it. I know I should probably use a for loop to go through the column, and compare every cell with the string the user typed, but nothing I tried so far worked.
I wrote a simple method that can get entries, for now it only prints but that can easily be changed:
static void ReadEntries()
{
var range = $"{sheet}!A1:F10";
var request = service.Spreadsheets.Values.Get(SpreadsheetId, range);
var response = request.Execute();
var values = response.Values;
if(values != null && values.Count > 0)
{
foreach(var row in values)
{
Console.WriteLine("{0} | {1}", row[0], row[1]);
}
}
else
{
Console.WriteLine("No data found");
}
}
and a method that can update a specific cell:
static void UpdateEntry()
{
var range = $"{sheet}!B2"; //example
var valueRange = new ValueRange();
var objectList = new List<object>() { "updated" };
valueRange.Values = new List<List<object>> { objectList };
var updateRequest = service.Spreadsheets.Values.Update(valueRange, SpreadsheetId, range);
updateRequest.ValueInputOption = SpreadsheetsResource.ValuesResource.AppendRequest.ValueInputOptionEnum.USERENTERED;
var updateResponse = updateRequest.Execute();
}
EDIT: I need help with making a for loop to go through my A column and find the student with the same name. I know how to update a cell. I just don't know how to find a cell that needs updating.

Sounds like you are very close. You already have the value you are searching in row[0] in the loop, so all you need to track the row number through your loop.
if (values != null && values.Count > 0)
{
int rowNo =0;
foreach (var row in values)
{
rowNo ++;
Console.WriteLine("{0} | {1}", row[0], row[1]);
if (row[0].ToString() == "John")
{
string rangeToUpdate = $"{sheet}!B{rowNo}:B{rowNo}";
...
}
}
}
You could also change from using a foreach to a standard for loop.

I'm not experienced in the .NET client library of the Sheets API.
However, having used the Sheets API with the node and python client libraries, I can point you to the documentation you should follow. This is the official API documentation, with code examples for each language having a Google-provided client library.
For example, here is the spreadsheets.values.update documentation that you use, with a code example for C#.
On to the question then:
According to the json representation of a ValueRange, ValueRange.Range does not seem optional even though it is redundant. You might need to add ValueRange.Range = range; in your code.
Plus, you are using SpreadsheetsResource.ValuesResource.AppendRequest instead of SpreadsheetsResource.ValuesResource.UpdateRequest in the definition of your ValueInputOption.
Let me know if it helped!
Update
This also seems to be a duplicate of Update a cell with C# and Sheets API V4

Google Sheets API v4: Delete row in C#

I'm trying to delete a single row with the Google Sheets API in C#, and I can't get it to work. Fetching and updating rows work as intended.
Here's my code (inspired from C# Google Sheets API - Delete Row and the Java documentation):
var request = new Request
{
DeleteDimension = new DeleteDimensionRequest
{
Range = new DimensionRange
{
SheetId = 0,
Dimension = "ROWS",
StartIndex = 21,
EndIndex = 21
}
}
};
var deleteRequest = new BatchUpdateSpreadsheetRequest {Requests = new List<Request> {request}};
// First way: create a batch update request from scratch then execute it
var responseFirstWay = new SpreadsheetsResource.BatchUpdateRequest(MY_SHEETS_SERVICE, deleteRequest, MY_SPREADSHEET_ID).Execute();
// Second way: create a batch update request from the existing SheetsService then execute it
var responseSecondWay = MY_SHEETS_SERVICE.Spreadsheets.BatchUpdate(deleteRequest, MY_SPREADSHEET_ID).Execute();
I've tried with various indexes, but it doesn't seem to change anything (in my example above I've put 21, with all rows up to 30 filled with data). I find a bit weird to have a SheetId set to zero, but that's the gid parameter I have when visiting Google Sheets.
No matter how I create my BatchUpdateRequest, the response is empty after I execute it.

According to the documentation for DimensionRange the indexes are half-open meaning the start-index is inclusive, and the end-index is exclusive.
You should change your EndIndex to 22 to delete the 21st row.

AMO get partitions where data are processed but not indexes

I am writing a script that return all unprocessed partitions within a measure group using the following command:
objMeasureGroup.Partitions.Cast<Partition>().Where(x => x.State != AnalysisState.Processed)
After doing some experiments, it looks like this property indicates if the data is processed and doesn't mention the indexes.
After searching for hours, i didn't find any method to list the partitions where data is processed but indexes are not.
Any suggestions?
Environment:
SQL Server 2014
SSAS multidimensional cube
Script are written within a SSIS package / Script task

First, ProcessIndexes is an incremental operation. So if you run it twice the second time will be pretty quick because there is nothing to do. So I would recommend just running it on the cube and not worrying about whether it was previously run. However if you do need to analyze the current state then read on.
The best way (only way I know of) to distinguish whether ProcessIndexes has been run on a partition is to study the DISCOVER_PARTITION_STAT and DISCOVER_PARTITION_DIMENSION_STAT DMVs as seen below.
The DISCOVER_PARTITION_STAT DMV returns one row per aggregation with the rowcount. The first row of that DMV has a blank aggregation name and represents the rowcount of the lowest level data processed in that partition.
The DISCOVER_PARTITION_DIMENSION_STAT DMV can tell you about whether indexes are processed and which range of values by each dimension attribute are in this partition (by internal IDs, so not super easy to interpret). We assume at least one dimension attribute is set to be optimized so it will be indexed.
You will need to add a reference to Microsoft.AnalysisServices.AdomdClient also to simplify running these DMVs:
string sDatabaseName = "YourDatabaseName";
string sCubeName = "YourCubeName";
string sMeasureGroupName = "YourMeasureGroupName";
Microsoft.AnalysisServices.Server s = new Microsoft.AnalysisServices.Server();
s.Connect("Data Source=localhost");
Microsoft.AnalysisServices.Database db = s.Databases.GetByName(sDatabaseName);
Microsoft.AnalysisServices.Cube c = db.Cubes.GetByName(sCubeName);
Microsoft.AnalysisServices.MeasureGroup mg = c.MeasureGroups.GetByName(sMeasureGroupName);
Microsoft.AnalysisServices.AdomdClient.AdomdConnection conn = new Microsoft.AnalysisServices.AdomdClient.AdomdConnection(s.ConnectionString);
conn.Open();
foreach (Microsoft.AnalysisServices.Partition p in mg.Partitions)
{
Console.Write(p.Name + " - " + p.State + " - ");
var restrictions = new Microsoft.AnalysisServices.AdomdClient.AdomdRestrictionCollection();
restrictions.Add("DATABASE_NAME", db.Name);
restrictions.Add("CUBE_NAME", c.Name);
restrictions.Add("MEASURE_GROUP_NAME", mg.Name);
restrictions.Add("PARTITION_NAME", p.Name);
var dsAggs = conn.GetSchemaDataSet("DISCOVER_PARTITION_STAT", restrictions);
var dsIndexes = conn.GetSchemaDataSet("DISCOVER_PARTITION_DIMENSION_STAT", restrictions);
if (dsAggs.Tables[0].Rows.Count == 0)
Console.WriteLine("ProcessData not run yet");
else if (dsAggs.Tables[0].Rows.Count > 1)
Console.WriteLine("aggs processed");
else if (p.AggregationDesign == null || p.AggregationDesign.Aggregations.Count == 0)
{
bool bIndexesBuilt = false;
foreach (System.Data.DataRow row in dsIndexes.Tables[0].Rows)
{
if (Convert.ToBoolean(row["ATTRIBUTE_INDEXED"]))
{
bIndexesBuilt = true;
break;
}
}
if (bIndexesBuilt)
Console.WriteLine("indexes have been processed. no aggs defined");
else
Console.WriteLine("no aggs defined. need to run ProcessIndexes on this partition to build indexes");
}
else
Console.WriteLine("need to run ProcessIndexes on this partition to process aggs and indexes");
}

I am posting this answer as additional information of #GregGalloway excellent answer
After searching for a while, the only way to know if partition are processed is using DISCOVER_PARTITION_STAT and DISCOVER_PARTITION_DIMENSION_STAT.
I found an article posted by Daren Gossbel describing the whole process:
SSAS: Are my Aggregations processed?
In the artcile above the author provided two methods:
using XMLA
One way in which you can find it out with an XMLA discover call to the DISCOVER_PARTITION_STAT rowset, but that returns the results in big lump of XML which is not as easy to read as a tabular result set.
example
<Discover xmlns="urn:schemas-microsoft-com:xml-analysis">
<RequestType>DISCOVER_PARTITION_STAT</RequestType>
<Restrictions>
<RestrictionList>
<DATABASE_NAME>Adventure Works DW</DATABASE_NAME>
<CUBE_NAME>Adventure Works</CUBE_NAME>
<MEASURE_GROUP_NAME>Internet Sales</MEASURE_GROUP_NAME>
<PARTITION_NAME>Internet_Sales_2003</PARTITION_NAME>
</RestrictionList>
</Restrictions>
<Properties>
<PropertyList>
</PropertyList>
</Properties>
</Discover>
using DMV queries
If you have SSAS 2008, you can use the new DMV feature to query this same rowset and return a tabular result.
example
SELECT *
FROM SystemRestrictSchema($system.discover_partition_stat
,DATABASE_NAME = 'Adventure Works DW 2008'
,CUBE_NAME = 'Adventure Works'
,MEASURE_GROUP_NAME = 'Internet Sales'
,PARTITION_NAME = 'Internet_Sales_2003')
Similar posts:
How to find out using AMO if aggregation exists on partition?
Detect aggregation processing state with AMO?

What is the analogue to Excel Interop's Worksheet.UsedRange.Rows in Spreadsheet Light?

Using Excel Interop, you can get the count of rows in use by a sheet like so:
_xlSheet.UsedRange.Rows
(where "_xlSheet" is an Excel.Worksheet).
What is the equivalent in Spreadsheet Light?
You can add a worksheet like so:
var sl = new SLDocument();
. . .
sl.AddWorksheet("SheetsToTheWind");
...but how can you then access that worksheet to interrogate it for its used row count?

After adding the worksheet it is active as well. That means that you can get the WorksheetStatistics from the method GetWorksheetStatistics. That statistics instance has a NumberOfRows property:
// NOTE: The information is only current at point of retrieval.
var stats = sl.GetWorksheetStatistics();
var rowcount = stats.NumberOfRows;
If you want to to know the rowcount of all sheets you can do:
foreach(var name in sl.GetSheetNames())
{
sl.SelectWorksheet(name);
var stats = sl.GetWorksheetStatistics();
var rowcount = stats.NumberOfRows;
Trace.WriteLine(String.Format("sheet '{0}' has {1} rows", name, rowcount));
}

Adding to rene's answer:
Since accessing the Statistics' NumberOfRows property does not automagically update (you must call GetWorksheetStatistics() each time to get the up-to-date stats), I found it handy to write this helper method:
private int GetCurrentNumberOfRows()
{
// This reference to "sl" assumes that you have declared "SLDocument sl;" and
// instantiated it ("sl = new SLDocument();"), perhaps in your class' constructor
var stats = sl.GetWorksheetStatistics();
return stats.NumberOfRows;
}
..and then call it as needed:
int lastRow = GetCurrentNumberOfRows();

How to find the data source of a Pivot Table using OpenXML

I am using EPP to open and edit an existing excel document.
The document contains 2 sheets - one with a pivot table (named Pivot) and one with the data (Data!$A$1:$L$9899).
I have a reference to the ExcelPivotTable with the code below, but can't find any properties that relate to the data source.
ExcelPackage package = new ExcelPackage(pivotSpreadsheet);
foreach (ExcelWorksheet worksheet in package.Workbook.Worksheets)
{
if (worksheet.PivotTables.Count > 0)
{
pivotWorkSheetName = worksheet.Name;
pivotTable = worksheet.PivotTables[0];
}
}
How do I get the name and range of the source data? Is there an obvious property that I'm missing or do I have to go hunting through some xml?

PivotTables use a data cache for the data store for performance & abstraction reasons. Remember, you can have a pivot that points to a web service call. The cache itself is what stores that reference. For pivots that refer to data elsewhere in a workbook, you can access it in EPPlus like this:
worksheet.PivotTables[0].CacheDefinition.SourceRange.FullAddress;

If anyone is interested to update the data source with OpenXML SDK 2.5 then here is the code I used.
using (var spreadsheet = SpreadsheetDocument.Open(filepath, true))
{
PivotTableCacheDefinitionPart ptp = spreadsheet.WorkbookPart.PivotTableCacheDefinitionParts.First();
ptp.PivotCacheDefinition.RefreshOnLoad = true;//refresh the pivot table on document load
ptp.PivotCacheDefinition.RecordCount = Convert.ToUInt32(ds.Tables[0].Rows.Count);
ptp.PivotCacheDefinition.CacheSource.WorksheetSource.Reference = "A1:" + IntToLetters(ds.Tables[0].Columns.Count) + (ds.Tables[0].Rows.Count + 1);//Cell Range as data source
ptp.PivotTableCacheRecordsPart.PivotCacheRecords.RemoveAllChildren();//it is rebuilt when pivot table is refreshed
ptp.PivotTableCacheRecordsPart.PivotCacheRecords.Count = 0;//it is rebuilt when pivot table is refreshed
}
public string IntToLetters(int value)//copied from another stackoverflow post
{
string result = string.Empty;
while (--value >= 0)
{
result = (char)('A' + value % 26) + result;
value /= 26;
}
return result;
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Split large excel into multiple excel file using azure function - c#

Related

In C# how do I go through a Google Sheets document and write into a specific cell

Google Sheets API v4: Delete row in C#

AMO get partitions where data are processed but not indexes

What is the analogue to Excel Interop's Worksheet.UsedRange.Rows in Spreadsheet Light?

How to find the data source of a Pivot Table using OpenXML

Categories

Resources