how to consolidate series data and make new dataframe in pandas? - pandas

I've got dataframe like this
original data
and I hope to have new dataframe like below
new data
How can I create code for this modification?
It need to consolidate first series data and create new dataframe.

Some imports:
import pandas as pd
import numpy as np
Here we create dataframe from data you provided:
df = pd.DataFrame({
"a" : [
'A2C02158300', 'D REC/BAS16-03W 100V 250mA SOD323 0s SMD', 'D201,D206,D218,D219,D222,D302,D308,D408', 'D409,D501,D502,D505,D506,D507,D508',
'A2C02250500', 'T BIP/PUMD3,SOT363,SMD SOLDERING', 'T209,T501,T502'
]
})
df.head(10)
Output:
Then we prepare dataframe with first 2 columns:
s1 = df.iloc[::4, :]
s1.reset_index(drop=True, inplace=True)
s2 = df.iloc[1::4, :]
s2.reset_index(drop=True, inplace=True)
df = pd.DataFrame({
'a': s1['a'],
'b': s2['a']
})
After that prepare and add third column:
s3 = df.iloc[2::4, :]
s3.reset_index(drop=True, inplace=True)
s3 = s3['a'].str.split(',').apply(pd.Series, 1).stack()
s3.index = s3.index.droplevel(-1)
s3.name = 'c'
df = df.join(s3)
df.reset_index(drop=True, inplace=True)
df
Output:

Related

Create multiple DataFrames using data from an api

I'm using the world bank API to analyze data and I want to create multiple data frames with the same indicators for different countries.
import wbgapi as wb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time as t_lib
#Variables
indicators = ['AG.PRD.LVSK.XD', 'AG.YLD.CREL.KG', 'NY.GDP.MINR.RT.ZS', 'SP.POP.TOTL.FE.ZS']
countries = ['PRT', 'BRL', 'ARG']
time = range(1995, 2021)
#Code
def create_df(country):
df = wb.data.DataFrame(indicators, country, time, labels = True).reset_index()
columns = [ item for item in df['Series'] ]
columns
df = df.T
df.columns = columns
df.drop(['Series', 'series'], axis= 0, inplace = True)
df = df.reset_index()
return df
list_of_dfs = []
for n in range(len(countries)):
var = create_df(countries[n])
list_of_dfs.append(var)
Want I really wanted is to create a data frame with a different name for a different country and to store them in a list or dict like: [df_1, df_2, df_3...]
EDIT:
I'm trying this now:
a_dictionary = {}
for n in range(len(countries)):
a_dictionary["key%s" %n] = create_df(countries[n])
It was suppose to work but still get the same error in the 2nd loop:
APIResponseError: APIError: JSON decoding error (https://api.worldbank.org/v2/en/sources/2/series/AG.PRD.LVSK.XD;AG.YLD.CREL.KG;NY.GDP.MINR.RT.ZS;SP.POP.TOTL.FE.ZS/country/BRL/time/YR1995;YR1996;YR1997;YR1998;YR1999;YR2000;YR2001;YR2002;YR2003;YR2004;YR2005;YR2006;YR2007;YR2008;YR2009;YR2010;YR2011;YR2012;YR2013;YR2014;YR2015;YR2016;YR2017;YR2018;YR2019;YR2020?per_page=1000&page=1&format=json)
UPDATE:
Thanks to notiv I noticed the problema was in "BRA" instead of "BRL".
I'm also putting here a new approach that works as well by creating a master dataframe and then slicing it by country to create the desired dataframes:
df = wb.data.DataFrame(indicators, countries, time, labels = True).reset_index()
columns = [ item for item in df['Series'] ]
columns
df = df.T
df.columns = columns
df.drop(['Series', 'series'], axis= 0, inplace = True)
df = df.reset_index()
df
a_dictionary = {}
for n in range(len(countries)):
new_df = df.loc[: , (df == countries[n]).any()]
new_df['index'] = df['index']
new_df.set_index('index', inplace = True)
new_df.drop(['economy', 'Country'], axis= 0, inplace = True)
a_dictionary["eco_df%s" %n] = new_df
for loop in range(len(countries)):
for n in range(len(a_dictionary[f'eco_df{loop}'].columns)):
sns.set_theme(style="dark")
g = sns.relplot( data= a_dictionary[f'eco_df{loop}'], x= a_dictionary[f'eco_df{loop}'].index, y= a_dictionary[f'eco_df{loop}'].iloc[:,n], kind="line", palette="crest",
height=5, aspect=1.61, legend=False).set(title=countries[loop])
g.set_axis_labels("Years")
g.set_xticklabels(rotation=45)
g.tight_layout()
At the end I used the dataframes to create a chart for each indicator for each country.
Many thanks for the help

(Python)Dataframe copy with selected columns and index

Whole dataframe can be copied to df2 as below.
How to copy only 'B' column and index in df to df2?
import pandas as pd
df = pd.DataFrame({'A': [10, 20, 30],'B': [100, 200, 300]}, index=['2021-11-24', '2021-11-25', '2021-11-26'])
df2 = df.copy()
You can simply select and then copy as follows:
df2 = df[['B']].copy()
I am using a list as the selection in order to have a DataFrame instead of a pd.Series.

How do to fill NaN's in a pandas dataframe with random 1's and 0's

How to replace the NaN values in a pandas dataframe with random 0's and 1's?
df.fillna(random.randint(0,1))
seems to fill the NaN's in certain columns with all 1's or all 0's
#Creating a dummy Dataframe
import pandas as pd
import numpy as np
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'Price': [100,np.nan,27000,np.nan]
}
df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
# Replacing NAN in a particular column
a = df.Price.isnull()
rand_int = np.random.randint(2, size=a.sum())
df.loc[a,'Price' ] = rand_int
print(df)
# For entire dataframe
for i in df:
a = df[i].isnull()
rand_int = np.random.randint(2, size=a.sum())
df.loc[a, i ] = rand_int
print(df)

How do I block a Keyerror in Python from reoccurring or create exception to handle it?

Im new to Python and working with API,
My code is below:
import pandas as pd
import json
from pandas.io.json import json_normalize
import datetime
threedaysago = datetime.date.fromordinal(datetime.date.today().toordinal()-3).strftime("%F")
import http.client
conn = http.client.HTTPSConnection("api.sendgrid.com")
payload = "{}"
keys = {
# "CF" : "SG.UdhzjmjYR**.-",
}
df = [] # Create new Dataframe
for name, value in keys.items():
headers = { 'authorization': "Bearer " + value }
conn.request("GET", "/v3/categories/stats/sums?aggregated_by=&start_date={d}&end_date={d}".format(d=threedaysago), payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
d = json.loads(data.decode("utf-8"))
c=d['stats']
# row = d['stats'][0]['name']
# Add Brand to data row here with 'name'
df.append(c) # Load data row into df
#1
df = pd.DataFrame(df[0])
df_new = df[['name']]
df_new.rename(columns={'name':'Category'}, inplace=True)
df_metric =pd.DataFrame(list(df['metrics'].values))
sendgrid = pd.concat([df_new, df_metric], axis=1, sort=False)
sendgrid.set_index('Category', inplace = True)
sendgrid.insert(0, 'Date', threedaysago)
sendgrid.insert(1,'BrandId',99)
sendgrid.rename(columns={
'blocks':'Blocks',
'bounce_drops' : 'BounceDrops',
'bounces': 'Bounces',
'clicks':'Clicks',
'deferred':'Deferred',
'delivered':'Delivered',
'invalid_emails': 'InvalidEmails',
'opens':'Opens',
'processed':'Processed',
'requests':'Requests',
'spam_report_drops' : 'SpamReportDrops',
'spam_reports' : 'SpamReports',
'unique_clicks' : 'UniqueClicks',
'unique_opens' : 'UniqueOpens',
'unsubscribe_drops' : 'UnsubscribeDrops',
'unsubscribes': 'Unsubscribes'
},
inplace=True)
When I run this however, I receive an error:
KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]"
I know the reason this happens is because there are no stats available for three days ago:
{"date":"2020-02-16","stats":[]}
But how do I handle these exceptions in my code because this is going to run as a daily report and it will break if this error is not handled.
Sorry for the late answer.
KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]" means there is no column called name in your dataframe.
But, you believe that error occurred because of "stats" : []. It is also not true. If any of the indexes is empty the error should occur as ValueError: arrays must all be same length
I have recreated this problem and I will show you to get an idea to overcome this problem.
Recreating KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]"
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c']}]
df = pd.DataFrame(df[0])
df = df[['D']]
print(df)
Output -:
KeyError: "None of [Index(['D'], dtype='object')] are in the [columns]"
Solution -: You can see there is no column called 'D' in the data frame. Therefore, recheck your columns
Adding 'D' and see what happens
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D': []}]
df = pd.DataFrame(df[0])
df = df[['D']]
print(df)
Output -:
ValueError: arrays must all be same length
Solution -: column 'D' need to fill same data count as 'A', 'B', and 'C'
Overcome from those two problems
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D':[]}]
df = pd.DataFrame.from_dict(df[0], orient='index')
df.transpose()
print(df)
Output -:
0 1 2
A 1 4 5
B 4 5 6
C a b c
D None None None
You can see columns are now represented as rows. You can use loc to select each row column.
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D':[]}]
df = pd.DataFrame.from_dict(df[0], orient='index')
df.transpose()
df = df.loc[['A']] # uses loc
print(df)
Output -:
0 1 2
A 1 4 5

loop through data frame

I want to change the orders of data frames using for loop but it doesn't work. My code is as follows:
import pandas as pd
df1 = pd.DataFrame({'a':1, 'b':2}, index=1)
df2 = pd.DataFrame({'c':3, 'c':4}, index=1)
for df in [df1, df2]:
df = df.loc[:, df.columns.tolist()[::-1]]
Then the order of columns of df1 and df2 is not changed.
You can make use of chain assignment with list comprehension i.e
df1,df2 = [i.loc[:,i.columns[::-1]] for i in [df1,df2]]
print(df1)
b a
1 2 1
print(df2)
c
1 4
Note: In my answer I am trying to build up to show that using a dictionary to store the datafrmes is the best way for a general case. If you are looking to mutate the original dataframe variables, #Bharath answer is the way to go.
Answer:
The code doesn't work because you are not assigning back to the list of dataframes. Here's how to fix that:
import pandas as pd
df1 = pd.DataFrame({'a':1, 'b':2}, index=[1])
df2 = pd.DataFrame({'c':3, 'c':4}, index=[1])
l = [df1, df2]
for i, df in enumerate(l):
l[i] = df.loc[:, df.columns.tolist()[::-1]]
so the difference, is that I iterate with enumerate to get the dataframe and it's index in the list, then I assign the changed dataframe to the original position in the list.
execution details:
Before apply the change:
In [28]: for i in l:
...: print(i.head())
...:
a b
1 1 2
c
1 4
In [29]: for i, df in enumerate(l):
...: l[i] = df.loc[:, df.columns.tolist()[::-1]]
...:
After applying the change:
In [30]: for i in l:
...: print(i.head())
...:
b a
1 2 1
c
1 4
Improvement proposal:
It's better to use a dictionary as follows:
import pandas as pd
d= {}
d['df1'] = pd.DataFrame({'a':1, 'b':2}, index=[1])
d['df2'] = pd.DataFrame({'c':3, 'c':4}, index=[1])
for i,df in d.items():
d[i] = df.loc[:, df.columns.tolist()[::-1]]
Then you will be able to reference your dataframes from the dictionary. For instance d['df1']
You can reverse columns and values:
import pandas as pd
df1 = pd.DataFrame({'a':1, 'b': 2}, index=[1])
df2 = pd.DataFrame({'c':3, 'c': 4}, index=[1])
print('before')
print(df1)
for df in [df1, df2]:
df.values[:,:] = df.values[:, ::-1]
df.columns = df.columns[::-1]
print('after')
print(df1)
df1
Output:
before
a b
1 1 2
after
b a
1 2 1