Convert json array column to list of dict in Pyspark - pandas

I created a pandas df from a pyspark df in the following way:
pd_df = (
df
.withColumn('city_list', F.struct(F.col('n_res'), F.col('city')))
.groupBy(['user_ip', 'created_month'])
.agg(
F.to_json(
F.sort_array(
F.collect_list(F.col('city_list')), asc=False
)
).alias('city_list')
)
).toPandas()
However the column city_list of my pd_df was not converted into a list of dict but into a string.
Here is an example of a column value: "[{'n_res': 40653, 'city': 00005}, {'n_res': 12498, 'city': 00008}]".
A solution could be to call pd_df.city_list.apply(eval) But I wouldn't consider it a safe approach.
How can I create a list of dict directly from pyspark?
Thanks.

Related

Convert multiple downloaded time series share to pandas dataframe

i downloaded the information about multiple shares using nsepy library for the last 10 days, but could not save it in the pandas dataframe.
Below code to download the multiples share data:
import datetime
from datetime import date
from nsepy import get_history
import pandas as pd
symbol=['SBIN','GAIL','NATIONALUM' ]
data={}
for s in symbol:
data[s]=get_history(s,start=date(2022, 11, 29),end=date(2022, 12, 12))
Below code using to convert the data to pd datafarme, but i am getting error
new = pd.DataFrame(data, index=[0])
new
error message:
ValueError: Shape of passed values is (14, 3), indices imply (1, 3)
Documentation of get_history sais:
Returns:
pandas.DataFrame : A pandas dataframe object
Thus, data is a dict with the symbol as keys and the pd.DataFrames as values. Then you are trying to insert a DataFrame inside of another DataFrame, that does not work. If you want to create a new MultiIndex Dataframe from the 3 existing DataFrames, you can do something like this:
result = {}
for df, symbol in zip(data.values(), data.keys()):
data = df.to_dict()
for key, value in data.items():
result[(symbol, key)] = value
df_multi = pd.DataFrame(result)
df_multi.columns
Result (just showing two columns per Symbol to clarifying the Multiindex structure)
MultiIndex([( 'SBIN', 'Symbol'),
( 'SBIN', 'Series'),
( 'GAIL', 'Symbol'),
( 'GAIL', 'Series'),
('NATIONALUM', 'Symbol'),
('NATIONALUM', 'Series')
Edit
So if you just want a single index DF, like in your attached file with the symbols in a column, you can simply to this:
new_df = pd.DataFrame()
for symbol in data:
# sequentally concat the DataFrames from your dict of DataFrames
new_df = pd.concat([data[symbol], new_df],axis=0)
new_df
Then the output looks like in your file.

pandas DataFrame remove Index from columns

I have a dataFrame, such that when I execute:
df.columns
I get
Index(['a', 'b', 'c'])
I need to remove Index to have columns as list of strings, and was trying:
df.columns = df.columns.tolist()
but this doesn't remove Index.
tolist() should be able to convert the Index object to a list:
df1 = df.columns.tolist()
print(df1)
or use values to convert it to an array:
df1 = df.columns.values
The columns attribute of a pandas dataframe returns an Index object, you cannot assign the list back to the df.columns (as in your original code df.columns = df.columns.tolist()), but you can assign the list to another variable.

Pandas - Creating empty Dataframe dynamically for every item in list

I have a list of few variable names. I am trying to see if we can have empty Dataframe created for each of these variable names.
sample_list = ['item_1','item_2','item_3']
I want to create an empty Dataframe for each of these 3 items in the list. The structure would be same as well. They would have two columns, namely Product_Name, Quantity.
Expected output:
Dataframe 1 : item_1
Dataframe 2 : item_2
Dataframe 3 : item_3
IIUC, I would create a dictionary of dataframes using dictionary comprehension:
dict_of_dfs = {k:pd.DataFrame(columns=['Product','Quantity']) for k in sample_list}
Then you can see your dataframes using:
dict_of_dfs['item_1']
dict_of_dfs['item_2']
Here is the working solution to what you have described.
Create an empty dictionary and think of the items in the list as the keys to access your dictionary. The data against each key is a Pandas DataFrame (empty) and has the two columns as you said.
sample_list = ['l1', 'l2', 'l3']
sample_dict = dict()
for index, item in enumerate(sample_list):
print('Creating an empty dataframe for the item at index {}'.format(index))
sample_dict['{}'.format(item)] = pd.DataFrame(columns=['Product_Name', 'Quantity'])
Check if the dictionary got correctly created:
print(sample_dict)
{'l1': Empty DataFrame
Columns: [Product_Name, Quantity]
Index: [],
'l2': Empty DataFrame
Columns: [Product_Name, Quantity]
Index: [],
'l3': Empty DataFrame
Columns: [Product_Name, Quantity]
Index: []}
And the keys of the dictionary are indeed the items in the list:
print(sample_dict.keys())
dict_keys(['l1', 'l2', 'l3'])
Cheers!
Intuitivelly I'd create a dict, where the keys are the elements in the list, and the values are the dataframes:
d = {}
for item in sample_list:
d[item] = pd.DataFrame() # df creation
To access the dfs:
d['item_1']...

Streamlit - Applying value_counts / groupby to column selected on run time

I am trying to apply value_counts method to a Dataframe based on the columns selected dynamically in the Streamlit app
This is what I am trying to do:
if st.checkbox("Select Columns To Show"):
all_columns = df.columns.tolist()
selected_columns = st.multiselect("Select", all_columns)
new_df = df[selected_columns]
st.dataframe(new_df)
The above lets me select columns and displays data for the selected columns. I am trying to see how could I apply value_counts/groupby method on this output in Streamlit app
If I try to do the below
st.table(new_df.value_counts())
I get the below error
AttributeError: 'DataFrame' object has no attribute 'value_counts'
I believe the issue lies in passing a list of columns to a dataframe. When you pass a single column in [] to a dataframe, you get back a pandas.Series object (which has the value_counts method). But when you pass a list of columns, you get back a pandas.DataFrame (which doesn't have value_counts method defined on it).
Can you try st.table(new_df[col_name].value_counts())
I think the error is because value_counts() is applicable on a Series and not dataframe.
You can try Converting ".value_counts" output to dataframe
If you want to apply on one single column
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
val_count_single = value_counts_df(new_df, selected_col)
If you want to apply for all object columns in the dataframe
def valueCountDF(df, object_cols):
c = df[object_cols].apply(lambda x: x.value_counts(dropna=False)).T.stack().astype(int)
p = (df[object_cols].apply(lambda x: x.value_counts(normalize=True,
dropna=False)).T.stack() * 100).round(2)
cp = pd.concat([c,p], axis=1, keys=["Count", "Percentage %"])
return cp
val_count_df_cols = valueCountDF(df, selected_columns)
And Finally, you can use st.table or st.dataframe to show the dataframe in your streamlit app

concat series onto dataframe with column name

I want to add a Series (s) to a Pandas DataFrame (df) as a new column. The series has more values than there are rows in the dataframe, so I am using the concat method along axis 1.
df = pd.concat((df, s), axis=1)
This works, but the new column of the dataframe representing the series is given an arbitrary numerical column name, and I would like this column to have a specific name instead.
Is there a way to add a series to a dataframe, when the series is longer than the rows of the dataframe, and with a specified column name in the resulting dataframe?
You can try Series.rename:
df = pd.concat((df, s.rename('col')), axis=1)
One option is simply to specify the name when creating the series:
example_scores = pd.Series([1,2,3,4], index=['t1', 't2', 't3', 't4'], name='example_scores')
Using the name attribute when creating the series is all I needed.
Try:
df = pd.concat((df, s.rename('CoolColumnName')), axis=1)