Loop to get the dataframe, and pass it to a function - pandas

I am trying to:
read files and store in dataset
store names of the dataframes in a dataframe
loop to recover the dataframe, and pass it to a function as a dataframe
It does't work because when I retrieve the name of the dataframe, it is a str object, not a dataframe, so the calculus fails.
df_files:
dataframe name
0 df_bureau bureau
1 df_previous_application previous_application
Code:
def missing_values_table_for(df_for, name):
mis_val_for = df_for.isnull().sum() # count null values
-> error
for index, row in df_files.iterrows():
missing_values_for = missing_values_table_for(dataframe, name)
Thanks in advance.

I believe the best here is working with dictionary of Dataframes creating by loop names of files by glob:
import glob
files = glob.glob('files/*.csv')
dfs = {f: pd.read_csv(f) for f in files}
for k, v in dfs.items():
df = v.isnull().sum()
print (df)

Related

Read json files in pandas dataframe

I have large pandas dataframe (17 000 rows) with a filepath in each row associated with a specific json file. For each row I want to read the json file content and extract the content into a new dataframe.
The dataframe looks something like this:
0 /home/user/processed/config1.json
1 /home/user/processed/config2.json
2 /home/user/processed/config3.json
3 /home/user/processed/config4.json
4 /home/user/processed/config5.json
... ...
16995 /home/user/processed/config16995.json
16996 /home/user/processed/config16996.json
16997 /home/user/processed/config16997.json
16998 /home/user/processed/config16998.json
16999 /home/user/processed/config16999.json
What is the most efficient way to do this?
I believe a simple for-loop might be best suited here?
import json
json_content = []
for row in df:
with open(row) as file:
json_content.append(json.load(file))
result = pd.DataFrame(json_content)
Generally, I'd try with iterrows() function (as a first hit to improve efficiency).
Implementation could possibly look like that:
import json
import pandas as pd
json_content = []
for row in df.iterrows():
with open(row) as file:
json_content.append(json.load(file))
result = pd.Series(json_content)
Possible solution is the following:
# pip install pandas
import pandas as pd
#convert column with paths to list, where: : - all rows, 0 - first column
paths = df.iloc[:, 0].tolist()
all_dfs = []
for path in paths:
df = pd.read_json(path, encoding='utf-8')
all_dfs.append(df)
Each df in all_dfs can be accessed individually or in loop by index like all_dfs[0], all_dfs[1] and etc.
If you wish you can merge all_dfs into the single dataframe.
dfs = df.concat(all_dfs, axis=1)

Getting same value from list in dataframe column using Python

I have dataframe in which there 3 columns, Now, I added one more column and in which I am adding unique values using random function.
I created list variable and using for loop I am adding random string in that list variable
after that, I created another loop in which I am extracting value of list and adding it in column's value.
But, Same value is adding in each row everytime.
df = pd.read_csv("test.csv")
lst = []
for i in range(20):
randColumn = ''.join(random.choice(string.ascii_uppercase + string.digits)
for i in range(20))
lst.append(randColumn)
for j in lst:
df['randColumn'] = j
print(df)
#Output.......
A B C randColumn
0 1 2 3 WHI11NJBNI8BOTMA9RKA
1 4 5 6 WHI11NJBNI8BOTMA9RKA
Could you please help me to fix this that Why each row has same value from list.
Updated to work correctly with any type of column in df.
If I got your question clearly, you can use method zip of rdd to achieve your goals.
from pyspark.sql import SparkSession, Row
import pyspark.sql.types as t
lst = []
for i in range(2):
rand_column = ''.join(random.choice(string.ascii_uppercase + string.digits) for i in range(20))
# Adding random strings as Row to list
lst.append(Row(random=rand_column))
# Making rdd from random strings array
random_rdd = sparkSession.sparkContext.parallelize(lst)
res = df.rdd.zip(random_rdd).map(lambda rows: Row(**(rows[0].asDict()), **(rows[1].asDict()))).toDF()

Restructering Dataframes (stored in dictionairy)

i have footballdata stored in a dictionairy by league for different seasons. So for example I have the results from 1 league for the seasons 2017-2020 in one dataframe stored in the dictionary. Now I need to create new dataframes by season, so that I have all the results from 2019 in one dataframe. What is the best way to do this?
Thank you!
assuming you are using open football as your source
use GitHub API to get all files in repo
function to normalize the JSON
simple to generate either a concatenated DF of a dict of all the results
import requests
import pandas as pd
# normalize footbal scores data into a dataframe
def structuredf(res):
js = res.json()
if "rounds" not in res.json().keys():
return (pd.json_normalize(js["matches"])
.pipe(lambda d: d.loc[:,].join(d["score.ft"].apply(pd.Series).rename(columns={0:"home",1:"away"})))
.drop(columns="score.ft")
.rename(columns={"round":"name"})
.assign(seasonname=js["name"], url=res.url)
)
df = (pd.json_normalize(pd.json_normalize(js["rounds"])
.explode("matches").to_dict("records"))
.assign(seasonname=js["name"], url=res.url)
.pipe(lambda d: d.loc[:,].join(d["matches.score.ft"].apply(pd.Series).rename(columns={0:"home",1:"away"})))
.drop(columns="matches.score.ft")
.pipe(lambda d: d.rename(columns={c:c.split(".")[-1] for c in d.columns}))
)
return df
# get listing of all datafiles that we're interested in
res = requests.get("https://api.github.com/repos/openfootball/football.json/git/trees/master?recursive=1")
dfm = pd.DataFrame(res.json()["tree"])
# concat into one dataframe
df = pd.concat([structuredf(res)
for p in dfm.loc[dfm["path"].str.contains(r".en.[0-9]+.json"), "path"].iteritems()
for res in [requests.get(f"https://raw.githubusercontent.com/openfootball/football.json/master/{p[1]}")]])
# dictionary of dataframe
d = {res.json()["name"]:structuredf(res)
for p in dfm.loc[dfm["path"].str.contains(r".en.[0-9]+.json"), "path"].iteritems()
for res in [requests.get(f"https://raw.githubusercontent.com/openfootball/football.json/master/{p[1]}")]}

Streamlit - Applying value_counts / groupby to column selected on run time

I am trying to apply value_counts method to a Dataframe based on the columns selected dynamically in the Streamlit app
This is what I am trying to do:
if st.checkbox("Select Columns To Show"):
all_columns = df.columns.tolist()
selected_columns = st.multiselect("Select", all_columns)
new_df = df[selected_columns]
st.dataframe(new_df)
The above lets me select columns and displays data for the selected columns. I am trying to see how could I apply value_counts/groupby method on this output in Streamlit app
If I try to do the below
st.table(new_df.value_counts())
I get the below error
AttributeError: 'DataFrame' object has no attribute 'value_counts'
I believe the issue lies in passing a list of columns to a dataframe. When you pass a single column in [] to a dataframe, you get back a pandas.Series object (which has the value_counts method). But when you pass a list of columns, you get back a pandas.DataFrame (which doesn't have value_counts method defined on it).
Can you try st.table(new_df[col_name].value_counts())
I think the error is because value_counts() is applicable on a Series and not dataframe.
You can try Converting ".value_counts" output to dataframe
If you want to apply on one single column
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
val_count_single = value_counts_df(new_df, selected_col)
If you want to apply for all object columns in the dataframe
def valueCountDF(df, object_cols):
c = df[object_cols].apply(lambda x: x.value_counts(dropna=False)).T.stack().astype(int)
p = (df[object_cols].apply(lambda x: x.value_counts(normalize=True,
dropna=False)).T.stack() * 100).round(2)
cp = pd.concat([c,p], axis=1, keys=["Count", "Percentage %"])
return cp
val_count_df_cols = valueCountDF(df, selected_columns)
And Finally, you can use st.table or st.dataframe to show the dataframe in your streamlit app

Append values to pandas dataframe incrementally inside for loop

I am trying to add rows to pandas dataframe incrementally inside the for loop.
My for loop is like below:
def print_values(cc):
data = []
for x in values[cc]:
data.append(labels[x])
# cc is a constant and data is a list. I need these values to be appended to a row in pandas dataframe.
# Pandas dataframe structure is like follows: df=pd.DataFrame(columns = ['Index','Names'])
print cc
print data
# This does not work - Not sure about the problem !!
#df_clustercontents.loc['Cluster_Index'] = cc
#df_clustercontents.loc['DatabaseNames'] = data
for x in range(0,10):
print_values(x)
I need the values "cc" and "data" to be appended to the dataframe incrementally.
Any help would be really appreciated !!
You can use ,
...
print(cc)
print(data)
df_clustercontents.loc[len(df_clustercontents)]=[cc,data]
...