How to create an array of dataframe by filtering a data frame with a Seq of lists in Scala - dataframe

One Df1 with all data, one of the column with vl1,vl3,vl4,vl2,vl3,vl4..etc and another dataframe valueDf with a single column which have unique values vl1,vl2,vl3..etc
I want to split valueDf into an array of 4 dataframes like
Df(0)=(vl1,vl4,vl5)
Df(1)=(vl3,vl2,vl6)
Df(2)=(vl7,vl8,vl9)... etc
So that i can do leftsemi join on Df1 and write in 4 different locations
/folder_Df(0)/
/foldwer_Df(1)/.. Etc
I tried doing a randomSplit as below
val Dfs = valueDf.randomSplit(Array.range(1,5).map(_.toDouble), 1)
But now I have set of values which need to go into Df(0),Df(1),Df(2) based on this list (vl1,vl4,vl5),(vl3,vl2,vl6),(vl7,vl8,vl9).

Related

Compile a count of similar rows in a Pandas Dataframe based on multiple column values

I have two Dataframes, one containing my data read in from a CSV file and another that has the data grouped by all of the columns but the last and reindexed to contain a column for the count of the size of the groups.
df_k1 = pd.read_csv(filename, sep=';')
columns_for_groups = list(df_k1.columns)[:-1]
k1_grouped = df_k1.groupby(columns_for_groups).size().reset_index(name="Count")
I need to create a series such that every row(i) in the series corresponds to row(i) in my original Dataframe but the contents of the series need to be the size of the group that the row belongs to in the grouped Dataframe. I currently have this, and it works for my purposes, but I was wondering if anyone knew of a faster or more elegant solution.
size_by_row = []
for row in df_k1.itertuples():
for group in k1_grouped.itertuples():
if row[1:-1] == group[1:-1]:
size_by_row.append(group[-1])
break
group_size = pd.Series(size_by_row)

Pyspark dynamic column selection from dataframe

I have a dataframe with multiple columns as t_orno,t_pono, t_sqnb ,t_pric,....and so on(it's a table with multiple columns).
The 2nd dataframe contains certain name of the columns from 1st dataframe. Eg.
columnname
t_pono
t_pric
:
:
I need to select only those columns from the 1st dataframe whose name is present in the 2nd. In above example t_pono,t_pric.
How can this be done?
Let's say you have the following columns (which can be obtained using df.columns, which returns a list):
df1_cols = ["t_orno", "t_pono", "t_sqnb", "t_pric"]
df2_cols = ["columnname", "t_pono", "t_pric"]
To get only those columns from the first dataframe that are present in the second one, you can do set intersection (and I cast it to a list, so it can be used to select data):
list(set(df1_cols).intersection(df2_cols))
And we get the result:
["t_pono", "t_pric"]
To put it all together and select only those columns:
select_columns = list(set(df1_cols).intersection(df2_cols))
new_df = df1.select(*select_columns)

Joining all elements in an array in a dataframe column with another dataframe

Let's say pcPartsInfoDf has the columns
pcPartCode:integer
pcPartName:string
And df has the array column
pcPartCodeList:array
|-- element:integer
The pcPartCodeList in df has a list of codes for each row that match with pcPartCode values in pcPartsInfoDf, but only pcPartsInfoDf has the names of the parts.
I'm trying to join the two dataframes so that we get a new column that is an array of strings for all the pc part names for a row, corresponding to the array of ints, pcPartCodeList. I tried doing this with the code below, but this only adds at most 1 part since pcPartName is typed as a string and only holds 1 value.
df
.join(pcPartsInfoDf, expr("array_contains(pcPartCodeList, pcPartCode"))
.select(computerDf("*"), pcPartsInfoDf("pcPartName"))
How could I collect all the pcPartName values corresponding to a pcPartCodeList for a row, and put them in an array of strings in that row?

Convert Series to Dataframe where series index is Dataframe column names

I am selecting row by row as follows:
for i in range(num_rows):
row = df.iloc[i]
as a result I am getting a Series object where row.index.values contains names of df columns.
But I wanted instead dataframe with only one row having dataframe columns in place.
When I do row.to_frame() instead of 1x85 dataframe (1 row, 85 cols) I get 85x1 dataframe where index contains names of columns and row.columns
outputs
Int64Index([0], dtype='int64').
But all I want is just original data-frame columns with only one row. How do I do it?
Or how do I convert row.index values to row.column values and change 85x1 dimension to 1x85
You just need to adding T
row.to_frame().T
Also change your for loop with adding []
for i in range(num_rows):
row = df.iloc[[i]]

Replace a subset of pandas data frame with another data frame

I have a data frame(DF1) with 100 columns.( one of the column is ID)
I have one more data frame(DF2) with 30 columns.( one column is ID)
I have to update the first 30 columns of the data frame(DF1) with the values in second data frame (DF2) keeping the rest of the values in the remaining columns of first data frame (DF1) intact.
update the first 30 column value in DF1 out of the 100 columns when the ID in second data frame (DF2) is present in first data frame (DF1).
I tested this on Python 3.7 but I see no reason for it not to work on 2.7:
joined = df1.reset_index() \
[['index', 'ID']] \
.merge(df2, on='ID')
df1.loc[joined['index'], df1.columns[:30]] = joined.drop(columns=['index', 'ID'])
This assumes that df2 doesn't have a column called index or the merge will fail saying duplicate key with suffix.
Here a slow-motion of its inner workings:
df1.reset_index() returns a dataframe same as df1 but with an additional column: index
[['index', 'ID']] extracts a dataframe containing just these 2 columns from the dataframe in #1
.merge(...) merges with df2 , matching on ID . The result (joined) is a dataframe with 32 columns: index, ID and the original 30 columns of df2.
df1.loc[<row_indexes>, <column_names>] = <another_dataframe> mean you want to replace at those particular cells with data from another_dataframe. Since joined has 32 columns, we need to drop the extra 2 (index and ID)