PySpark Aggregation on Comma Seperated Column - dataframe

I have a huge DataFrame with two of many columns: "NAME", "VALUE". One of the row value for "NAME" column is "X,Y,V,A".
I want to transpose my DataFrame so the "NAME" values are columns and the average of the "VALUE" are the row values.
I used the pivot function:
df1 = df.groupby('DEVICE', 'DATE').pivot('NAME').avg('VALUE')
All NAME values except for "X,Y,V,A" work well with the above. I am not sure how to separate the 4 values of "X,Y,V,A" and aggregate on individual value.

IIUC, you need to split and explode the string first:
from pyspark.sql.functions import split, explode
df = df.withColumn("NAME", explode(split("NAME", ",")))
Now you can group and pivot:
df1 = df.groupby('DEVICE', 'DATE').pivot('NAME').avg('VALUE')

Related

Pandas how to group by day and other column

I am getting the daily counts of rows from a dataframe using
df = df.groupby(by=df['startDate'].dt.date).count()
How can I modify this so I can also group by another column 'unitName'?
Thank you
Use list with GroupBy.size:
df = df.groupby([df['startDate'].dt.date, 'unitName']).size()
If need count non missing values, e.g. column col use DataFrameGroupBy.count:
df = df.groupby([df['startDate'].dt.date, 'unitName'])['col'].count()

DataFrame Groupby apply on second dataframe?

I have 2 dataframes df1, df2. Both have id as a column. I want to compute a new column, weighted_average, in df1 that is a function of the values in df2 with the same id.
First, I think I should do df1.groupby("id"). Is it possible to use GroupBy.apply(...) and have it use values from df2? In the examples I've seen, it usually just operates on df1 values.
If they have same id positions and length, you can do some like:
df2["new column name"] = df1["column name"].apply(...)

Pyspark dynamic column selection from dataframe

I have a dataframe with multiple columns as t_orno,t_pono, t_sqnb ,t_pric,....and so on(it's a table with multiple columns).
The 2nd dataframe contains certain name of the columns from 1st dataframe. Eg.
columnname
t_pono
t_pric
:
:
I need to select only those columns from the 1st dataframe whose name is present in the 2nd. In above example t_pono,t_pric.
How can this be done?
Let's say you have the following columns (which can be obtained using df.columns, which returns a list):
df1_cols = ["t_orno", "t_pono", "t_sqnb", "t_pric"]
df2_cols = ["columnname", "t_pono", "t_pric"]
To get only those columns from the first dataframe that are present in the second one, you can do set intersection (and I cast it to a list, so it can be used to select data):
list(set(df1_cols).intersection(df2_cols))
And we get the result:
["t_pono", "t_pric"]
To put it all together and select only those columns:
select_columns = list(set(df1_cols).intersection(df2_cols))
new_df = df1.select(*select_columns)

Create columns which correspond to the number of characters contained in strings of a dataframe column

I have a dataframe, the first column contains string (eg:'AABCD'). I have to count occurences for each string. Then the results for each count must be stored in column (one column for each character, A,B,C,D).
See below
I have the following dataframe:
I want to get:
Remark: Columns A, B, C, D contain the number of characters for each string in each line
I want to create columns A,B,C,D with the number of characters for each string in each line
Assuming the columns are already in the dataframe, and the column containing the strings to start really is a column and not the index:
Set up dataframe:
df = pd.DataFrame({
"string":["AABCD", "ACCB", "AB", "AC"],
"A":[float("nan"),float("nan"),float("nan"),float("nan")],
"B":[float("nan"),float("nan"),float("nan"),float("nan")],
"C":[float("nan"),float("nan"),float("nan"),float("nan")],
"D":[float("nan"),float("nan"),float("nan"),float("nan")],
})
Loop through the columns and apply a lambda function to each row.
for col_name in df.columns:
if col_name == "string":
continue
df[col_name]=df.apply(lambda row: row["string"].count(col_name), axis=1)

Convert Series to Dataframe where series index is Dataframe column names

I am selecting row by row as follows:
for i in range(num_rows):
row = df.iloc[i]
as a result I am getting a Series object where row.index.values contains names of df columns.
But I wanted instead dataframe with only one row having dataframe columns in place.
When I do row.to_frame() instead of 1x85 dataframe (1 row, 85 cols) I get 85x1 dataframe where index contains names of columns and row.columns
outputs
Int64Index([0], dtype='int64').
But all I want is just original data-frame columns with only one row. How do I do it?
Or how do I convert row.index values to row.column values and change 85x1 dimension to 1x85
You just need to adding T
row.to_frame().T
Also change your for loop with adding []
for i in range(num_rows):
row = df.iloc[[i]]