iterate over spark dataframe column in C#

iterate over spark dataframe column in C# - c#

I am using the microsoft.spark version 1.0.0 to process a parquet file that is received from the network. I have mapped the parquet file into a Dataframe and i am trying to strip the alias from the userid column which is stored as an email address (user testuser is saved as testuser#gmail.com). Can't seem to figure out how to turn Column into an enumerable and there is not built in way to iterate over the column. Any pointers would be helpful
dataframe.WithColumn("alias", ExtractAlias(dataframe["userid"]))
ExtractAlias(Column userID_column){need to iterate over the column to extract out the user alias}

using Microsoft.Spark.Sql;
dataframe.WithColumn("alias", Functions.split(Functions.col("userid"), "#").GetItem(0))

Related

Store filename and objectid tuple in collection or query direct using the filename

Currently I'm struggeling with querying data from a MongoDB GridFS by using the filename.
Current situation
I'm uploading data to the MongoDB by using a GridFSBucket and UploadFromBytesAsync which returns an ObjectId for referencing the data and takes a filename as parameter.
The Problem
The oposit of UploadFromBytesAsync is DownloadAsBytesAsync which should be used for querying the data from GridFS. But this method only accepts an ObjectId and not the unique filename which is also passed when using UploadFromBytesAsync. But I want to query the data by using the filename.
Possible solution
My Idea was, to create a collection, which stores the GridFS ObjectId and GridFS filename, to map them. So when querying data, i will search for the filename in the collection and then using the ObjectId to get the bytes from GridFS. The filename property would also get an index.
Or should I directly query from GridFS using the unique filename? And is the filename indexed than?
EDIT!
Oh, i think the problem is solved by using DownloadAsBytesByName.
Thank you very much!

As long as file names are unique - you can use get by file name.
To ensure there is only one file, you can use storage option: OverWrite
You can create index on fs collection to include fileName field.
-- edit
the call with overwrite loos like:
_db.GridFS. Upload(OverWrite(fileName, binaryStream))

Selecting the rows of table of ms word file through c#

I have a table which is in ms word .i need to fill the rows with some data through c#.
I was thinking to fill through array or list or some other data sources.
But the challenge is how to select the first row of the table, there are n number of tables is in my word file.

What type of libary are you using ? Try DocX https://docx.codeplex.com/.
It has great table integration. You can use a template file or a file as a template and create a new table. That way you dont need to worry about table indexin. Or you can use text replacement and fill the tables in the template file whit placeholders.

How to get .docx files based on key words

I have a sql table with the columns "candidatename", "candidatelocation" and "resume".here resume column have only .docx type files in binary form. from front end I need to enter some words or phrases. My requirement is to get all the records which contain these words(or phrases) in .docx file("resume" column).. Here I'm not getting that how to search the given words with the binary type column.. I need this using asp.net with c# and sql server

You will need to install the Microsoft Filter Pack (http://support.microsoft.com/en-us/kb/945934) which will enable you to create a full text index on the varbinary column you are using to store the .docx document.

SSIS Task for inconsistent column count import?

Problem.
I regularly receive a feed files from different suppliers. Although the column names are consistent the problem comes when some suppliers send text files with more or less columns in there feed file.
Furthermore the arrangement of these files are inconsistent.
Other than the Dynamic data flow task provided by Cozy Roc is there another way I could import these files. I am not a C# guru but i am driven torwards using a "Script Task" control flow or "Script Component" Data flow task.
Any suggestion, samples or direction will greatly be appreciated.
http://www.cozyroc.com/ssis/data-flow-task
Some forums
http://www.sqlservercentral.com/Forums/Topic525799-148-1.aspx#bm526400
http://www.bidn.com/forums/microsoft-business-intelligence/integration-services/26/dynamic-data-flow

Off the top of my head, I have a 50% solution for you.
The problem
SSIS really cares about meta data so variations in it tend to result in exceptions. DTS was far more forgiving in this sense. That strong need for consistent meta data makes use of the Flat File Source troublesome.
Query based solution
If the problem is the component, let's not use it. What I like about this approach is that conceptually, it's the same as querying a table-the order of columns does not matter nor does the presence of extra columns matter.
Variables
I created 3 variables, all of type string: CurrentFileName, InputFolder and Query.
InputFolder is hard wired to the source folder. In my example, it's C:\ssisdata\Kipreal
CurrentFileName is the name of a file. During design time, it was input5columns.csv but that will change at run time.
Query is an expression "SELECT col1, col2, col3, col4, col5 FROM " + #[User::CurrentFilename]
Connection manager
Set up a connection to the input file using the JET OLEDB driver. After creating it as described in the linked article, I renamed it to FileOLEDB and set an expression on the ConnectionManager of "Data Source=" + #[User::InputFolder] + ";Provider=Microsoft.Jet.OLEDB.4.0;Extended Properties=\"text;HDR=Yes;FMT=CSVDelimited;\";"
Control Flow
My Control Flow looks like a Data flow task nested in a Foreach file enumerator
Foreach File Enumerator
My Foreach File enumerator is configured to operate on files. I put an expression on the Directory for #[User::InputFolder] Notice that at this point, if the value of that folder needs to change, it'll correctly be updated in both the Connection Manager and the file enumerator. In "Retrieve file name", instead of the default "Fully Qualified", choose "Name and Extension"
In the Variable Mappings tab, assign the value to our #[User::CurrentFileName] variable
At this point, each iteration of the loop will change the value of the #[User::Query to reflect the current file name.
Data Flow
This is actually the easiest piece. Use an OLE DB source and wire it as indicated.
Use the FileOLEDB connection manager and change the Data Access mode to "SQL Command from variable." Use the #[User::Query] variable in there, click OK and you're ready to work.
Sample data
I created two sample files input5columns.csv and input7columns.csv All of the columns of 5 are in 7 but 7 has them in a different order (col2 is ordinal position 2 and 6). I negated all the values in 7 to make it readily apparent which file is being operated on.
col1,col3,col2,col5,col4
1,3,2,5,4
1111,3333,2222,5555,4444
11,33,22,55,44
111,333,222,555,444
and
col1,col3,col7,col5,col4,col6,col2
-1111,-3333,-7777,-5555,-4444,-6666,-2222
-111,-333,-777,-555,-444,-666,-222
-1,-3,-7,-5,-4,-6,-2
-11,-33,-77,-55,-44,-666,-222
Running the package results in these two screen shots
What's missing
I don't know of a way to tell the query based approach that it's OK if a column doesn't exist. If there's a unique key, I suppose you could define your query to have only the columns that must be there and then perform lookups against the file to try and obtain the columns that ought to be there and not fail the lookup if the column doesn't exist. Pretty kludgey though.

Our solution. We use parent child packages. In the parent pacakge we take the individual client files and transform them to our standard format files then call the child package to process the standard import using the file we created. This only works if the client is consistent in what they send though, if they try to change their format from what they agreed to send us, we return the file.

Creating an ETL system (Data import and transformation)

I have been tasked to write a module for importing data into a client's system.
I thought to break the process into 4 parts:
1. Connect to the data source (SQL, Excel, Access, CSV, ActiveDirectory, Sharepoint and Oracle) - DONE
2. Get the available tables/data groups from the source - DONE
i. Get the available fields form the selected table/data group - DONE
ii. Get all data from the selected fields - DONE
3. Transform data to the user's requirements
4. Write the transformed data the the MSSQL target
I am trying to plan how to handle complex data transformations like:
Get column A from Table tblA, inner joined to column FA from table tblB, and concatenate these two with a semicolon in between.
OR
Get column C from table tblC on source where column tblC.D is not in table tblG column G on target database.
My worry is not the visual, but the representation in code of this operation.
I am NOT asking for sample code, but rather for some creative ideas.
The data transformation will not be with free text, but drag and drop objects that represent actions.
I am a bit lost, and need some fresh input.

maybe you can grab some ideas from this open source project: Rhino ETL.

See my answer: Manipulate values in a datatable?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.