Duplicate row in PySpark Dataframe based off value in another column - dataframe

I have a dataframe that looks like the following:
ID NumRecords
123 2
456 1
789 3
I want to create a new data frame that concatenates the two columns and duplicates the rows based on the value in NumRecords
So the output should be
ID_New 123-1
ID_New 123-2
ID_New 456-1
ID_New 789-1
ID_New 789-2
ID_New 789-3
I was looking into the "explode" function but it seemed to take only a constant based on the example I saw.

I had a similar issue, this code will duplicate the rows based on the value in the NumRecords column:
from pyspark.sql import Row
def duplicate_function(row):
data = [] # list of rows to return
to_duplicate = float(row["NumRecords"])
i = 0
while i < to_duplicate:
row_dict = row.asDict() # convert a Spark Row object to a Python dictionary
row_dict["SERIAL_NO"] = str(i)
new_row = Row(**row_dict) # create a Spark Row object based on a Python dictionary
to_return.append(new_row) # adds this Row to the list
i += 1
return data # returns the final list
# create final dataset based on value in NumRecords column
df_flatmap = df_input.rdd.flatMap(duplicate_function).toDF(df_input.schema)

You can use udf
from pyspark.sql.functions import udf, explode, concat_ws
from pyspark.sql.types import *
range_ = udf(lambda x: [str(y) for y in range(1, x + 1)], ArrayType(StringType()))
df.withColumn("records", range_("NumRecords") \
.withColumn("record", explode("records")) \
.withColumn("ID_New", concat_ws("-", "id", "record"))

Related

Execute same function on different columns to make rows to append another table

How could I perform the same operation for 15 columns on a DataFrame?
How could I parallelize the operation?
I have an input data that I need to update a reference table. There are more columns but I think these 3 help to understand what I am trying to do.
Table: input
rowid
col1
col2
col3
id1
col1_data1
col2_data1
col3_data1
id2
col1_data2
col2_data2
col3_data2
The reference table contains the values of each corresponding cell of the column, then the md5 and finally the column name
Table: references
col_data
md5
ref_name
col1_data1
md5_col1_data1
col1_name
col1_data2
md5_col1_data2
col1_name
col1_data3
md5_col1_data3
col1_name
col2_data1
md5_col2_data1
col2_name
col2_data2
md5_col2_data2
col2_name
col2_data3
md5_col2_data3
col2_name
col3_data1
md5_col3_data1
col3_name
col3_data2
md5_col3_data2
col3_name
col3_data3
md5_col3_data3
col3_name
I created a function similar to this that checks the input table against
the reference table and when new data is found then the reference is created and
a dataframe is returned so that at the end the references table is updated
def repeatedly_excuted_funcion(input_data, references, col_name):
"""
input_data is the full dataframe
references is the table to check if has the value and if not create it
col_name is the name of the column that will be considered on the execution
"""
# ... some code ...
return partial_df
df_col1 = repeatedly_excuted_funcion(input_data, references, "col1")
df_col2 = repeatedly_excuted_funcion(input_data, references, "col2")
data_to_append = df_col1.union(df_col2)
df_col3 = repeatedly_excuted_funcion(input_data, references, "col3")
data_to_append = data_to_append.union(df_col2)
I only put a 3 column example but there are 15 columns to check.
At the end the idea is to update the references table with the new calculated md5 values.
(
data_to_append.write.format("delta")
.mode("append")
.saveAsTable(database_table)
)
No function, no unions. 1 shuffle (anti join).
Create all the 3 final columns (data, md5, col_name) inside the array in Input table
Unpivot - from every 1 row of 15 cols make 1 col of 15 rows
Split the 1 array col into 3 data cols
Filter out rows which already exist in References
Append result
from pyspark.sql import functions as F
cols = ['col1', 'col2',..., 'col15']
# Change Input columns to arrays
df_input = df_input.select(
*[F.array(F.col(c), F.md5(c), F.lit(c)).alias(c) for c in cols]
)
# Unpivot Input table
stack_string = ", ".join([f"`{c}`" for c in cols])
df_input2 = df_input.select(
F.expr(f"stack({len(cols)}, {stack_string}) as col_data"))
# Make 3 columns from 1 array column
df_input3 = df_input2.select(
F.element_at('col_data', 1).alias('col_data'),
F.element_at('col_data', 2).alias('md5'),
F.element_at('col_data', 3).alias('ref_name'),
)
# Keep only rows which don't exist in References table
data_to_append = df_input3.join(df_references, 'col_data', 'anti')
(
data_to_append.write.format("delta")
.mode("append")
.saveAsTable(database_table)
)
Create an empty DF with the correct schema.
Get All the columns,
Union this to all the rows.
I'm not sure for 15 itesm it's worth parallelizing, or you wouldn't run into issues with spark context (as it's not availble inside an executor). Meaning you would have to have pure python code inside of repeatedly_excuted_function. You might be able to do all rows at once with a UDF, but I'm not sure if that would perform as well. (UDFs are known for poor performance due to the lack of vectorization).
from pyspark.sql.types import StructType,StructField, StringType
unionSchema = StructType([
StructField('column', StringType(), True)])
my_union = spark.createDataFrame( data = [] ,
schema = unionSchema )
for i in myDF.columns:
my_union = my_union.union(repeatedly_excuted_funcion(input_data, references, i)
what about pivoting the data and performing one join?
The code below creates map, the input is a little annoying as I create in python a list of [lit(column_name1), col(column_name1), lit(column_name2), ...] , the main purpose of this map is to explode it and then the first table is in a similar format as the reference df and one normal join can be performed.
from itertools import chain
from pyspark.sql.functions import create_map, array, lit, col, explode
column_names = ["col1", "col2", "col3"]
df \
.withColumn("features_map", create_map(
list(chain(*[(lit(c), col(c)) for c in column_names]))
)) \
.select("rowid", explode("features_map").alias("ref_name", "col_data")) \
.join(ref_df, on=["ref_name", "col_data"], how="left") ....

Getting same value from list in dataframe column using Python

I have dataframe in which there 3 columns, Now, I added one more column and in which I am adding unique values using random function.
I created list variable and using for loop I am adding random string in that list variable
after that, I created another loop in which I am extracting value of list and adding it in column's value.
But, Same value is adding in each row everytime.
df = pd.read_csv("test.csv")
lst = []
for i in range(20):
randColumn = ''.join(random.choice(string.ascii_uppercase + string.digits)
for i in range(20))
lst.append(randColumn)
for j in lst:
df['randColumn'] = j
print(df)
#Output.......
A B C randColumn
0 1 2 3 WHI11NJBNI8BOTMA9RKA
1 4 5 6 WHI11NJBNI8BOTMA9RKA
Could you please help me to fix this that Why each row has same value from list.
Updated to work correctly with any type of column in df.
If I got your question clearly, you can use method zip of rdd to achieve your goals.
from pyspark.sql import SparkSession, Row
import pyspark.sql.types as t
lst = []
for i in range(2):
rand_column = ''.join(random.choice(string.ascii_uppercase + string.digits) for i in range(20))
# Adding random strings as Row to list
lst.append(Row(random=rand_column))
# Making rdd from random strings array
random_rdd = sparkSession.sparkContext.parallelize(lst)
res = df.rdd.zip(random_rdd).map(lambda rows: Row(**(rows[0].asDict()), **(rows[1].asDict()))).toDF()

how to put first value in one column and remaining into other column?

ROCO2_CLEF_00001.jpg,C3277934,C0002978
ROCO2_CLEF_00002.jpg,C3265939,C0002942,C2357569
I want to make a pandas data frame from csv file.
I want to put first row entry(filename) into a column and give the column/header name "filenames", and remaining entries into another column name "class". How to do so?
in case your file hasn't a fixed number of commas per row, you could do the following:
import pandas as pd
csv_path = 'test_csv.csv'
raw_data = open(csv_path).readlines()
# clean rows
raw_data = [x.strip().replace("'", "") for x in raw_data]
print(raw_data)
# make split between data
raw_data = [ [x.split(",")[0], ','.join(x.split(",")[1:])] for x in raw_data]
print(raw_data)
# build the pandas Dataframe
column_names = ["filenames", "class"]
temp_df = pd.DataFrame(data=raw_data, columns=column_names)
print(temp_df)
filenames class
0 ROCO2_CLEF_00001.jpg C3277934,C0002978
1 ROCO2_CLEF_00002.jpg C3265939,C0002942,C2357569

Streamlit - Applying value_counts / groupby to column selected on run time

I am trying to apply value_counts method to a Dataframe based on the columns selected dynamically in the Streamlit app
This is what I am trying to do:
if st.checkbox("Select Columns To Show"):
all_columns = df.columns.tolist()
selected_columns = st.multiselect("Select", all_columns)
new_df = df[selected_columns]
st.dataframe(new_df)
The above lets me select columns and displays data for the selected columns. I am trying to see how could I apply value_counts/groupby method on this output in Streamlit app
If I try to do the below
st.table(new_df.value_counts())
I get the below error
AttributeError: 'DataFrame' object has no attribute 'value_counts'
I believe the issue lies in passing a list of columns to a dataframe. When you pass a single column in [] to a dataframe, you get back a pandas.Series object (which has the value_counts method). But when you pass a list of columns, you get back a pandas.DataFrame (which doesn't have value_counts method defined on it).
Can you try st.table(new_df[col_name].value_counts())
I think the error is because value_counts() is applicable on a Series and not dataframe.
You can try Converting ".value_counts" output to dataframe
If you want to apply on one single column
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
val_count_single = value_counts_df(new_df, selected_col)
If you want to apply for all object columns in the dataframe
def valueCountDF(df, object_cols):
c = df[object_cols].apply(lambda x: x.value_counts(dropna=False)).T.stack().astype(int)
p = (df[object_cols].apply(lambda x: x.value_counts(normalize=True,
dropna=False)).T.stack() * 100).round(2)
cp = pd.concat([c,p], axis=1, keys=["Count", "Percentage %"])
return cp
val_count_df_cols = valueCountDF(df, selected_columns)
And Finally, you can use st.table or st.dataframe to show the dataframe in your streamlit app

How can I add values from pandas group to new Dataframe after a function?

I am trying to separate a Dataframe into groups, run each group through a function, and have the return value from the first row of each group placed into a new Dataframe.
When I try the code below, I can print out the information I want, but when I try to add it to the new Dataframe, it only shows the values for the last group.
How can I add the values from each group into the new Dataframe?
Thanks,
Here is what I have so far:
import pandas as pd
import numpy as np
#Build random dataframe
df = pd.DataFrame(np.random.randint(0,40,size=10),
columns=["Random"],
index=pd.date_range("20200101", freq='6h',periods=10))
df["Random2"] = np.random.randint(70,100,size=10)
df["Random3"] = 2
df.index =df.index.map(lambda t: t.strftime('%Y-%m-%d'))
df.index.name = 'Date'
df.reset_index(inplace=True)
#Setup groups by date
df = df.groupby(['Date']).apply(lambda x: x.reset_index())
df.drop(["index","Date"],axis=1,inplace = True)
#Creat new dataframe for newValue
df2 = pd.DataFrame(index=(df.index)).unstack()
#random function for an example
def any_func(df):
df["Value"] = df["Random"] * df["Random2"] / df["Random3"]
return df["Value"]
#loop by unique group name
for date in df.index.get_level_values('Date').unique():
#I can print the data I want
print(any_func(df.loc[date])[0])
#But when I add it to a new dataframe, it only shows the value from the last group
df2["newValue"] = any_func(df.loc[date])[0]
df2
Unrelated, but try modifying your any_func to take advantage of vectorized functions is possible.
Now if I understand you correctly:
new_value = df['Random'] * df['Random2'] / df['Random3']
df2['New Value'] = new_value.loc[:, 0]
This line of code gave me the desired outcome. I just needed to set the index using the "date" variable when I created the column, not when I created the Dataframe.
df2.loc[date, "newValue"] = any_func(df.loc[date])[0]