Getting same value from list in dataframe column using Python - dataframe

I have dataframe in which there 3 columns, Now, I added one more column and in which I am adding unique values using random function.
I created list variable and using for loop I am adding random string in that list variable
after that, I created another loop in which I am extracting value of list and adding it in column's value.
But, Same value is adding in each row everytime.
df = pd.read_csv("test.csv")
lst = []
for i in range(20):
randColumn = ''.join(random.choice(string.ascii_uppercase + string.digits)
for i in range(20))
lst.append(randColumn)
for j in lst:
df['randColumn'] = j
print(df)
#Output.......
A B C randColumn
0 1 2 3 WHI11NJBNI8BOTMA9RKA
1 4 5 6 WHI11NJBNI8BOTMA9RKA
Could you please help me to fix this that Why each row has same value from list.

Updated to work correctly with any type of column in df.
If I got your question clearly, you can use method zip of rdd to achieve your goals.
from pyspark.sql import SparkSession, Row
import pyspark.sql.types as t
lst = []
for i in range(2):
rand_column = ''.join(random.choice(string.ascii_uppercase + string.digits) for i in range(20))
# Adding random strings as Row to list
lst.append(Row(random=rand_column))
# Making rdd from random strings array
random_rdd = sparkSession.sparkContext.parallelize(lst)
res = df.rdd.zip(random_rdd).map(lambda rows: Row(**(rows[0].asDict()), **(rows[1].asDict()))).toDF()

Related

Pandas: Creating empty dataframe in for loop, appending

I would like to create a ((25520*43),3) pandas Dataframe in a for loop.
I created the dataframe like:
lst=['Region', 'GeneID', 'DistanceValue']
df=pd.DataFrame(index=lst).T
And now I want to fill 'Region', 43 times with 25520 values. Also GeneID and DistanceValue.
This is my for loop for that:
for i in range(43):
df.DistanceValue = np.sort(distance[i,:])
df.Region = np.ones(25520) * i
args = np.argsort(distance[i,:])
df.GeneID = ids[int(args[i])]
But than my df exists just of (25520, 3). So I just have the last iteration for 43 filled in.
How can I concat all iteration one to 43 in my df?
I can't reproduce your example but there are couple of corrections you can make:
lst=['Region', 'GeneID', 'DistanceValue']
df=pd.DataFrame(index=lst).T
region = []
for i in range(43):
region.append(np.ones(25520))
flat_list = [item for sublist in region for item in sublist]
df.Region = flat_list
First create a new list outside loop and then append values within loop in this list.
The flat_list will consolidate all 43 lists to one and then you can map it to the DataFrame. It is always easier to fill DataFrame values outside of loop.
Similarly you can update all 3 columns.

Streamlit - Applying value_counts / groupby to column selected on run time

I am trying to apply value_counts method to a Dataframe based on the columns selected dynamically in the Streamlit app
This is what I am trying to do:
if st.checkbox("Select Columns To Show"):
all_columns = df.columns.tolist()
selected_columns = st.multiselect("Select", all_columns)
new_df = df[selected_columns]
st.dataframe(new_df)
The above lets me select columns and displays data for the selected columns. I am trying to see how could I apply value_counts/groupby method on this output in Streamlit app
If I try to do the below
st.table(new_df.value_counts())
I get the below error
AttributeError: 'DataFrame' object has no attribute 'value_counts'
I believe the issue lies in passing a list of columns to a dataframe. When you pass a single column in [] to a dataframe, you get back a pandas.Series object (which has the value_counts method). But when you pass a list of columns, you get back a pandas.DataFrame (which doesn't have value_counts method defined on it).
Can you try st.table(new_df[col_name].value_counts())
I think the error is because value_counts() is applicable on a Series and not dataframe.
You can try Converting ".value_counts" output to dataframe
If you want to apply on one single column
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
val_count_single = value_counts_df(new_df, selected_col)
If you want to apply for all object columns in the dataframe
def valueCountDF(df, object_cols):
c = df[object_cols].apply(lambda x: x.value_counts(dropna=False)).T.stack().astype(int)
p = (df[object_cols].apply(lambda x: x.value_counts(normalize=True,
dropna=False)).T.stack() * 100).round(2)
cp = pd.concat([c,p], axis=1, keys=["Count", "Percentage %"])
return cp
val_count_df_cols = valueCountDF(df, selected_columns)
And Finally, you can use st.table or st.dataframe to show the dataframe in your streamlit app

Loop to get the dataframe, and pass it to a function

I am trying to:
read files and store in dataset
store names of the dataframes in a dataframe
loop to recover the dataframe, and pass it to a function as a dataframe
It does't work because when I retrieve the name of the dataframe, it is a str object, not a dataframe, so the calculus fails.
df_files:
dataframe name
0 df_bureau bureau
1 df_previous_application previous_application
Code:
def missing_values_table_for(df_for, name):
mis_val_for = df_for.isnull().sum() # count null values
-> error
for index, row in df_files.iterrows():
missing_values_for = missing_values_table_for(dataframe, name)
Thanks in advance.
I believe the best here is working with dictionary of Dataframes creating by loop names of files by glob:
import glob
files = glob.glob('files/*.csv')
dfs = {f: pd.read_csv(f) for f in files}
for k, v in dfs.items():
df = v.isnull().sum()
print (df)

Merging Dataframe within a for loop

I tried to perform my self-created function on a for loop, but it does not work as expected.
Some remarks in advance:
ma_strategy is my function and requires three inputs
ticker_list is a list with strings
result is a pandas Dataframe with 7 columns and I can call the column 'return_cum' with result['return_cum']. The rows of this column are containing floating point numbers.
These for loops doesn't work:
for i in ticker_list:
result = ma_strategy(i, 20, 5)
x = result['return_cum']
sample_returns = pd.DataFrame
y = pd.merge(x.to_frame(),sample_returns, left_index=True)
for i in ticker_list:
result = ma_strategy(i, 20, 5)
x = result[['return_cum']]
sample_returns = pd.DataFrame
y = pd.concat([sample_returns, x], axis=1)
My intention is the following:
The for loop should iterate over the items in my ticker_list and should save the 'return_cum' columns in x. Then the 'return_cum' columns should be stored in y together so that at the end I get a DataFrame with all the 'return_cum' columns of my ticker list.
How can I achieve that goal? I tried pd.concoat and merge, but nothing works.
Thanks for your help!

Duplicate row in PySpark Dataframe based off value in another column

I have a dataframe that looks like the following:
ID NumRecords
123 2
456 1
789 3
I want to create a new data frame that concatenates the two columns and duplicates the rows based on the value in NumRecords
So the output should be
ID_New 123-1
ID_New 123-2
ID_New 456-1
ID_New 789-1
ID_New 789-2
ID_New 789-3
I was looking into the "explode" function but it seemed to take only a constant based on the example I saw.
I had a similar issue, this code will duplicate the rows based on the value in the NumRecords column:
from pyspark.sql import Row
def duplicate_function(row):
data = [] # list of rows to return
to_duplicate = float(row["NumRecords"])
i = 0
while i < to_duplicate:
row_dict = row.asDict() # convert a Spark Row object to a Python dictionary
row_dict["SERIAL_NO"] = str(i)
new_row = Row(**row_dict) # create a Spark Row object based on a Python dictionary
to_return.append(new_row) # adds this Row to the list
i += 1
return data # returns the final list
# create final dataset based on value in NumRecords column
df_flatmap = df_input.rdd.flatMap(duplicate_function).toDF(df_input.schema)
You can use udf
from pyspark.sql.functions import udf, explode, concat_ws
from pyspark.sql.types import *
range_ = udf(lambda x: [str(y) for y in range(1, x + 1)], ArrayType(StringType()))
df.withColumn("records", range_("NumRecords") \
.withColumn("record", explode("records")) \
.withColumn("ID_New", concat_ws("-", "id", "record"))