Converting consecutive and repeated questions in an excel column into a table - c#

I have an excel spreadsheet with repeated questions down a single column. It spans over 3000 rows deep. The data is confidential so I can only provide the example image below:
I wish to take this data and group the answers into a table so it is suitable for export into an XML file. Example Result:
I have some educational experience with c# and stream read/write but I'm sure there would be a VBA macro that could perform this much quicker. I have discovered the data is inconsistent due to unanswered questions not having a blank cell beneath. This was due to me attempting to convert the data from headings and text in word.
This is my first post; some pointers on how to write a better question would be appreciated if necessary.

I can think of a rather manual solution, instead of using VBA Macro. You'd essentially need to carry out the following steps, then if you're confident with the outcome (and the logic of identifying outliers) record a macro for ease of use later.
Use column B to "Mark" each "Start", probably by finding the string "Reference Code". For B1 that would be: =IF(TRIM(A1)="Reference Code","#","") Fill this to the end of column B, and add a "#" into the next row of column B for good measure
For column C, you'll string the values if B is blank, or reset if B has "#". C2 would be: =IF(B2="#",A2,C1&"§"&A2)Fill this to the end of column C. Now column C keeps concatenating values until it reaches a new "Reference Code" identified by "#"
For column D, you'll need to "mark" the last line of every set; this is the only line with valid data, all the others before are incomplete. D1 would be =IF(B2="#","*","")Fill this is to the end of column D
Copy column C & D and Paste Values. This locks in the concatenated strings, and the end markers.
Sort by Column D, and delete any rows without the end marker "*". Then delete column D altogether
Using "Data">"Text to Columns", split Column C up. You will use the "Delimited" setting, select "Other" and use "§" as the delimiter (it was used in Step 2).
Till here it is all automated, and you might like to record a macro with all the previous steps to run it again if you need to.
Now the columns C to L contain the relevant data, alternating between the keys "Reference Code", "Question 1", etc. and the relevant value data in between. Now sorting by each field name column (C, E, G, I, K) you should expect to always find the keyword; working from left to right, any column that does not have the correct key consistently has incorrect data. Shift the data in such columns/rows to the right until they fit into place. This effectively takes out the missing values.
Delete the field name columns, and you've got yourself a full set of data with empty values reflected.

Related

Using a classification algorithm for splitting a full name into first and last names?

I have a customers table with 2 columns for the name (firstname, lastname) and it contains around 100k records.
I have a scenario where I have to import new customers but their names come as a single column. Most names are simple (first and last), but some names are double names (with a space or hyphen), double surnames (with a space or hyphen) or even both.
Does using a ML.NET classification algorithm make sense to split the fullname based on a trained model from the 100k records?
I think it would be unnecessary to use machine learning methods for such a problem. You should try a rule-based method here.
Assuming the data comes in 1 column:
For example: After splitting the Text by space, is the length of the word count equal to 2? If equal, the 1st word is the name and the 2nd word is the surname.
Example 2: Does the text contain hyphen or not? If yes, what should I do? How can I determine my name and surname?
1-) What you need to do here is to create a training, validation and test set for yourself.
2-) Doing a coding with the rules you extracted from the data in the train set. (Here you need to make clever deductions by examining the data)
3-) You need to determine the most ideal rules with validation data.
Finally, you should evaluate your work by getting results on the test set with the rule you find most ideal.

ListObject.Resize() makes DataBodyRange null when row count == 1

I am trying to reset the number of columns in an Excel ListObject. I know you can add and remove columns one-by-one, but I want to avoid unnecessary loops. I instead decided to resize the ListObject using the Resize method.
Here is the code that I am using (where OutputCasesTable is the ListObject):
OutputCasesTable.DataBodyRange.Value2 = "";
OutputCasesTable.Resize(OutputCasesTable.Range.Resize[ColumnSize: CaseCount]);
OutputCasesTable.DataBodyRange.Value2 = OutputCasesAray;
The above lines of code appear to work perfectly, however if the ListObject only contains 1 row of data, the DataBodyRange of the ListObject becomes null on the second line - producing an error when I try to change its cell's value. The row in excel still appears to be present.
The MSDN documentation says the following:
"The header must remain in the same row and the resulting list must overlap the original list. The list must contain a header row and at least one row of data."
Now I understand that "one row of data" implies that the row contains values - so the cause of the error here must be that the DataBodyRange cells all contain no value (""). However, a table with two data rows containing "" still doesn't have a row with data, does it?
I know there are many ways of accomplishing this task, but I want to understand why this happens.
Temporary Solution:
Replaced the code to only set the values to empty strings in columns that will be removed (columns above the new column count). All other columns will be replaced:
if(OutputCasesTable.ListColumns.Count - CaseCount > 0)
OutputCasesTable.DataBodyRange.Offset[ColumnOffset: CaseCount].Resize[ColumnSize: OutputCasesTable.ListColumns.Count - CaseCount].Value2 = "";
OutputCasesTable.Resize(OutputCasesTable.Range.Resize[ColumnSize: CaseCount]);
OutputCasesTable.DataBodyRange.Value2 = OutputCasesAray;
Personally I prefer looking at the first solution!
Is there anything I can do make it work with empty strings? Or do you have a better solution?
Best regards,
The Resize operation is the piece that kills the DataBodyRange, and clearly there's some internal logic that Resize uses, along the lines of "if there is only one row, and all the cells are empty, remove all the data rows. If there is more than one row, don't remove any".
I agree that this logic is a bit confounding. If your question is why did Microsoft implement it this way, I'd argue that although it's inconsistent, it's perhaps tidier in a way - it appears to the model that you're working with an empty table, and there's no way for the model to tell the difference graphically (it's not possible for a table to just have a header row).
When Resize turns up to do its work and finds a single-row blank table, it can't tell whether you have a zero-row table or a single-row table with empty strings. If it arrives and finds two empty rows, that's unambiguous (they must be meaningful rows).
For the workaround portion of your question, I'd suggest a tidier solution of just checking the ListRows.Count property, and adding one if necessary. Note that you can also use Clear instead of setting Value2 to blank; for me it reads as more self-explanatory.
OutputCasesTable.DataBodyRange.Clear();
OutputCasesTable.Resize(OutputCasesTable.Range.Resize[ColumnSize: CaseCount]);
if (OutputCasesTable.ListRows.Count == 0) OutputCasesTable.ListRows.Add();
OutputCasesTable.DataBodyRange.Value2 = OutputCasesAray;

Create a non editable attribute in Excel cells that is still readable for a c# program

I have a very particular problem. I looked for similar problem to mine, test a lot of solution everyone proposed, but none of them is what I need.
My client need to export data sheet in excel format. Those data can be sorted, modified, rearranged, new values can be entered, some lines may disappear, some other can take their places, in short, anything can happen to those data. For example purpose, let's say that we export a list of item shown in a grocery list.
ItemID ItemName Price
Fr01 Apple 2.5
Fr02 Orange 4.0
Mt01 Beef 10.0
Mt02 Pork 8.33
Vg01 Carrot 1.25
My problem is that this data can be imported back in the software that originally created the excel to update (or add) these values in database base on the "ItemID". I already do validation if data is "correct" in value and type and interrelationality.
I tried to put a name to the range. The problem is when data is filter / sorted, the name don't follow the content, it stand still at the same position
original : (Range name is the name of the range, not an actual column)
ItemID ItemName Price || Range Name
Fr01 Apple 2.5 || data_fr01
Fr02 Orange 4.0 || data_fr02
Mt01 Beef 10.0 || data_mt01
Mt02 Pork 8.33 || data_tm02
Vg01 Carrot 1.25 || data_vg01
after sorting on ItemName:
ItemID ItemName Price || Range Name
Fr01 Apple 2.5 || data_fr01
Mt01 Beef 10.0 || data_fr02
Vg01 Carrot 1.25 || data_mt01
Mt02 Pork 8.33 || data_tm02
Fr02 Orange 4.0 || data_vg01
As you can see, all the info correctly follow, except the Range Name, so, when I try to import, I got a lot of data mismatch.
My other try was to make the NameRange an actual cell in excel. With this method, the cell follow, but can be changed, so I try to create a protected cell. Sadly, lines can't be inserted or deleted because of that. I found a workaround that consist in having names in a masked sheet, but once again, I need to synchronize sheets, which is not reliable for the same reasons mention previously.
Even worst, I must support both xls (97-2003) and xlsx.
So I'm looking for a stable workaround that will allow me to store somehow my "range name" data in the cell, making it invisible for the Excel User, but will follow the data so i can retrieve it at the right place when re-importing data.
Thanks in advance.
EDIT :
At finale, I must be able to write this property from C# application and then read back that same property with C#, and it must be compatible both excel file format, not viewable nor editable by excel user but stay with it's original value set, whatever happen to the data within the sheet except deletion (I don't mind if I just put it on the cell I wrote Apple in and not the entire range)
OK (I still think its better to add validation intelligence to the worksheet when you export but YMMV).
Try using the Range.ID string property - its not editable or visible from the Excel UI and it moves around with the cell. If the cell gets deleted it disappears. If a cell gets copied the ID property gets copied so there would be a duplicate.
It was introduced in Excel 2000 so probably won't work for Excel 97 but should be OK in all file formats for Excel 2000 to Excel 2013.
Here is some example VBA code:
Sub putids()
Dim j As Long
For j = 1 To 5
Range("a1").Offset(j - 1, 0).ID = CStr(j)
Next j
End Sub
Sub getids()
Dim j As Long
For j = 1 To 5
Debug.Print Range("a1").Offset(j - 1, 0).ID
Next j
End Sub
I think you should use some key column be it a unique name you've made up, a concatenation of the records making up your data row. Whatever. Make that as the left most column, hide it and lock it do users can't show that column or change it's contents.
Then in another worksheet, take those same values and starting in A2 paste them in.
Now in B2 enter this formula
=VLOOKUP(<this row's key value>,<Your data array in sheet1>,<column number>,FALSE)
Here is an example of how to so the fixed column/row settings
=VLOOKUP($A2,BigNamedRange,B$1,FALSE)
now Hide that sheet.
Now what you have in the first sheet is an area where your users can filter/sort/do whatever and in your second, controlled sheet, you have the data in the order you want to see it (which can be changed independently from the user's sheet).
Edit:
Click on 1: Allow Users TO Edit Ranges and set the range you want to let users edit.
Then, 2:, click Protect Sheet/Protect Workbook (which ever you need) to lock everything else.
Now your users can edit what you let them and not edit everything else
I don't see how named ranges help you.Have you thought of adding Validation code to the workbook using the before save event, so that the user cannot save data that is not valid? Or seeing how much you can do using Excel's data validation rules.Otherwise you have to read all the data and validate it later at DB update time (which is basically too late) Presumably the basic validation is that the iTemID is valid - your DB code won't care what order the data is in, and can skip empty rows etc.
Using a little of everyone suggestion and merge them.
Since any simple and normal solution isn't viable in our context and since the only possible property we can try to put something in (ID) isn't persistent and with the fact we need the client not to accidentally destroy the value considering the fact that anything may happen and will happen since there is no much restriction and the fact that we can't lock a part of the sheet without disabling line manipulation because of the side effect of the presence of a locked cell, the closest thing we were able to achieve was to insert our keys as a formatted string in column A with a weird looking formula allowing us to hide from display, then we hide the column, making it unreachable accidentally by the user.
=IF(FALSE,"our formatted string","")
Since this hidden column has data, it follows its line when sorted and trying to copy the entire line won't be possible with the fact that we select only from column B (which cause to try to insert 256 values in 255 cells) we can control a little the "false duplicate", even if not totally eliminated.
On the importer side, we just read back with a little trick comparing the formula with the value (since value is empty, only formula got our formatted data) and having a little regex to retrieve the meaning of our formatted string then doing all our validations before the actual database import.
For the rest, it will go to the training part of the user to not "delete" the data in column A, and not searching for it.
Thanks again to everyone.

Interop - Setting a range for an Excel chart to an entire row

How do I set the source data of an excel interop chart to several entire rows?
I have a .csv file that is created by my program to display some results that are produced. For the sake of simplicity let's say these results and chart are displayed like this: (which is exactly how I want it to be)
Now the problem I am having is that the number of people is variable. So I really need to access the entire rows data.
Right now, I am doing this:
var range = worksheet.get_range("A1","D3");
xlExcel.ActiveChart.SetSourceData(range);
and this works great if you only have three Persons, but I need to access the entire row of data.
So to restate my question, how can I set the source data of my chart to several entire rows?
I tried looking here but couldn't seem to make that work with rows instead of columns.
var range = worksheet.get_range("A1").CurrentRegion;
xlExcel.ActiveChart.SetSourceData(range);
EDIT: I am assuming that the cells in the data region won't be blank.
To test this,
1) place cursor on cell A1
2) press F5
3) click on "Special"
4) choose "Current Region" as option
5) click "OK"
This will select the cells surrounding A1 which are filled, which I believe is what you are looking for.
The translation of that in VBA code points to CurrentRegion property. I think, that should work.
Check Out the option Range.EntireRow I'm not 100% on how to expand that to a single range containing 3 entire rows, but it shouldn't be that difficult to accomplish.
Another thing you can do is scan to get the actual maximum column index you need (this is assuming that there are guaranteed to be no gaps in the names), then use that index as you declare your range.
Add Code
int c = 2;//column b
while(true)
{
if (String.IsNullOrEmpty(worksheet.GetRange(1,c).Value2))
{
c--;
break;
}
c++;
}
Take a column from A to D that you're sure has no empty cells.
Do some loop to find the first empty one in that column and it will be one after the last.
Range Cell = SHeet.Range["A1"]; //or another column you're sure there's no empty data
int LineOffset = 0;
while (Cell.Offset[LineOffset, 0].Value != "") //maybe you should cast the left side to string, not sure.
{
LineOffset++;
}
int LastLine = LineOffset - 1;
Then you can get Range[Sheet.Cells[1,1], Sheet.Cells[LastLine, 4]]
Out of the box here, but why not transpose the data? Three columns for Name, Height, Weight. Convert this from an ordinary range to a Table.
When any formula, including a chart's SERIES formula references a column of a table, it always references that column, no matter how long the table gets. Add another person (another row) and the chart displays the data with the added person. Remove a few people, and the chart adjusts without leaving blanks at the end.
This is illustrated in my tutorial, Easy Dynamic Charts Using Lists or Tables.

parsing address "label" fields in Excel, C#, VBA, other?

Someone's sent me a Word file full off address labels separated by tabs. See
I'm trying to figure out the best way to import the addresses into individual records. Probably just go with NameLine, Address1, Address2 for each one (3 fields that I can parse later).
What can I do easily with C# or VBA? Or UltraEdit?
I like Excel for things like this. Just copy the text from Word, paste it into Excel, and use the text import wizard with a tab delimiter, making sure to treat consecutive delimiters as one.
Excel can even parse it for you:
Cut and paste the columns so that it's just one long column with all the addresses. (Let's say column A)
Assuming each address record is 3 lines long, we want to get that into a format with three columns: Name, Address1, Address2.
In Cell B1, create formula =A1.
In Cell C1, create formula =A2.
In Cell D1, create formula =A3.
Select cells B1 through D3, or D4 if you have blank lines between each address record.
Copy.
Go to cell B4, or B5 if there's blank lines between each address record.
CTRL+END to select everything until the end of the data (basically, cells B5:DXX should be selected)
Paste.
Create a new record at the top with your desired fields names.
Example result:
Afterwards, you can copy the results into a new worksheet (sans formulae, so it'll just be static text), format the data however you want it, and sort the data to remove those pesky blank lines.
If all the tabs line up in Word, you should be able to Alt-Select to select individual columns, then cut & paste them into one sequential column so you just get one contiguous file of Address1,Address2,Address3,BlankLine, which should then be trivial to parse.

Categories

Resources