I used some code that works on vscode :
df.iat[row_marker,column_marker] = thisText
At first df is not defined, row_marker,column_marker equal zero;
It creates new line and column correctly !
I tried the same code in jupiterlab but got an IndexError : I assume it is because the column number is not knowed....
Why is this working in vscode but not in JupiterLab ?
What is the method to add new values manually when we don't know the size of the dataframe ?
New comment : the dataframe exist but is empty
If you have an empty dataframe you can add rows via the append method
import pandas as pd
df = pd.DataFrame()
thisText = 'Hello World'
df = df.append({'column': thisText}, ignore_index=True)
You could use this pattern to build a dataframe that can grow in both dimensions:
import pandas as pd
site_no = 0
df = pd.DataFrame()
while site_no < 4: # your website counter
if site_no < 2: # website with col0 and col1 data
row = pd.DataFrame(
{'col0:': 'Hello', 'col1': 'World'}, index=[site_no])
else: # website with col2 and col3 data
row = pd.DataFrame(
{'col2:': 'Say', 'col3': 'Goodbye'}, index=[site_no])
df = df.append(row)
site_no += 1
df
col0: col1 col2: col3
0 Hello World NaN NaN
1 Hello World NaN NaN
2 NaN NaN Say Goodbye
3 NaN NaN Say Goodbye
Related
I've a Pandas DataFrame with 3 columns:
c={'a': [['US']],'b': [['US']], 'c': [['US','BE']]}
df = pd.DataFrame(c, columns = ['a','b','c'])
Now I need the max value of these 3 columns.
I've tried:
df['max_val'] = df[['a','b','c']].max(axis=1)
The result is Nan instead of the expected output: US.
How can I get the max value for these 3 columns? (and what if one of them contains Nan)
Use:
c={'a': [['US', 'BE'],['US']],'b': [['US'],['US']], 'c': [['US','BE'],['US','BE']]}
df = pd.DataFrame(c, columns = ['a','b','c'])
from collections import Counter
df = df[['a','b','c']].apply(lambda x: list(Counter(map(tuple, x)).most_common()[0][0]), 1)
print (df)
0 [US, BE]
1 [US]
dtype: object
if it as # Erfan stated, most common value in a row then .agg(), mode
df.agg('mode', axis=1)
0
0 [US, BE]
1 [US]
while your data are lists, you can't use pandas.mode(). because lists objects are unhashable and mode() function won't work.
a solution is converting the elements of your dataframe's row to strings and then use pandas.mode().
check this:
>>> import pandas as pd
>>> c = {'a': [['US','BE']],'b': [['US']], 'c': [['US','BE']]}
>>> df = pd.DataFrame(c, columns = ['a','b','c'])
>>> x = df.iloc[0].apply(lambda x: str(x))
>>> x.mode()
# Answer:
0 ['US', 'BE']
dtype: object
>>> d = {'a': [['US']],'b': [['US']], 'c': [['US','BE']]}
>>> df2 = pd.DataFrame(d, columns = ['a','b','c'])
>>> z = df.iloc[0].apply(lambda z: str(z))
>>> z.mode()
# Answer:
0 ['US']
dtype: object
As I can see you have some elements as a list type, So I think the below-mentioned code will work fine.
First, append all value into an array
Then, find the most occurring element from that array.
from scipy.stats import mode
arr = []
for i in df:
for j in range(len(df[i])):
for k in range(len(df[i][j])):
arr.append(df[i][j][k])
from collections import Counter
b = Counter(arr)
print(b.most_common())
this will give you an answer as you want.
Im new to Python and working with API,
My code is below:
import pandas as pd
import json
from pandas.io.json import json_normalize
import datetime
threedaysago = datetime.date.fromordinal(datetime.date.today().toordinal()-3).strftime("%F")
import http.client
conn = http.client.HTTPSConnection("api.sendgrid.com")
payload = "{}"
keys = {
# "CF" : "SG.UdhzjmjYR**.-",
}
df = [] # Create new Dataframe
for name, value in keys.items():
headers = { 'authorization': "Bearer " + value }
conn.request("GET", "/v3/categories/stats/sums?aggregated_by=&start_date={d}&end_date={d}".format(d=threedaysago), payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
d = json.loads(data.decode("utf-8"))
c=d['stats']
# row = d['stats'][0]['name']
# Add Brand to data row here with 'name'
df.append(c) # Load data row into df
#1
df = pd.DataFrame(df[0])
df_new = df[['name']]
df_new.rename(columns={'name':'Category'}, inplace=True)
df_metric =pd.DataFrame(list(df['metrics'].values))
sendgrid = pd.concat([df_new, df_metric], axis=1, sort=False)
sendgrid.set_index('Category', inplace = True)
sendgrid.insert(0, 'Date', threedaysago)
sendgrid.insert(1,'BrandId',99)
sendgrid.rename(columns={
'blocks':'Blocks',
'bounce_drops' : 'BounceDrops',
'bounces': 'Bounces',
'clicks':'Clicks',
'deferred':'Deferred',
'delivered':'Delivered',
'invalid_emails': 'InvalidEmails',
'opens':'Opens',
'processed':'Processed',
'requests':'Requests',
'spam_report_drops' : 'SpamReportDrops',
'spam_reports' : 'SpamReports',
'unique_clicks' : 'UniqueClicks',
'unique_opens' : 'UniqueOpens',
'unsubscribe_drops' : 'UnsubscribeDrops',
'unsubscribes': 'Unsubscribes'
},
inplace=True)
When I run this however, I receive an error:
KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]"
I know the reason this happens is because there are no stats available for three days ago:
{"date":"2020-02-16","stats":[]}
But how do I handle these exceptions in my code because this is going to run as a daily report and it will break if this error is not handled.
Sorry for the late answer.
KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]" means there is no column called name in your dataframe.
But, you believe that error occurred because of "stats" : []. It is also not true. If any of the indexes is empty the error should occur as ValueError: arrays must all be same length
I have recreated this problem and I will show you to get an idea to overcome this problem.
Recreating KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]"
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c']}]
df = pd.DataFrame(df[0])
df = df[['D']]
print(df)
Output -:
KeyError: "None of [Index(['D'], dtype='object')] are in the [columns]"
Solution -: You can see there is no column called 'D' in the data frame. Therefore, recheck your columns
Adding 'D' and see what happens
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D': []}]
df = pd.DataFrame(df[0])
df = df[['D']]
print(df)
Output -:
ValueError: arrays must all be same length
Solution -: column 'D' need to fill same data count as 'A', 'B', and 'C'
Overcome from those two problems
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D':[]}]
df = pd.DataFrame.from_dict(df[0], orient='index')
df.transpose()
print(df)
Output -:
0 1 2
A 1 4 5
B 4 5 6
C a b c
D None None None
You can see columns are now represented as rows. You can use loc to select each row column.
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D':[]}]
df = pd.DataFrame.from_dict(df[0], orient='index')
df.transpose()
df = df.loc[['A']] # uses loc
print(df)
Output -:
0 1 2
A 1 4 5
I have this function
import pandas as pd
import numpy as np
from shapely.geometry import Point, Polygon
def test(e, n):
polygon = Polygon([(340,6638),(340,6614),(375,6620),(374,6649)])
point_instance = Point((e, n))
a = polygon.contains(point_instance)
val = np.where(a, 0, np.nan)
return pd.Series([val])
I want to apply above function in my dataframe and then remove the nan
def testData(filename):
df = pd.read_csv(filename)
df['check'] = df\
.apply(lambda x: test(x['E'], x['N']), axis=1)
# I tried both of these and doesnt delete nan values
df.dropna(axis=0, how = 'any', inplace = True)
df1 = df.dropna(axis=0, how='any', subset=['check'])
However, if i save data in a file and use dropna, then it works.
Sample dataframe
Id,E,N
1,5,8
2,6,9
3,7,10
This is the output I am getting
Id E N check
1 5 8 nan
2 6 9 nan
3 7 10 nan
It seems using np.nan inside np.where creates conflict datatypes.
And for that reason, pandas dropna didnt work.
I fixed using pandas map inside my function
a = pd.Series(polygon.contains(point_instance))
val = a.map({True: 0, False: np.nan})
return val
I know that there are several ways to build up a dataframe in Pandas. My question is simply to understand why the method below doesn't work.
First, a working example. I can create an empty dataframe and then append a new one similar to the documenta
In [3]: df1 = pd.DataFrame([[1,2],], columns = ['a', 'b'])
...: df2 = pd.DataFrame()
...: df2.append(df1)
Out[3]: a b
0 1 2
However, if I do the following df2 becomes None:
In [10]: df1 = pd.DataFrame([[1,2],], columns = ['a', 'b'])
...: df2 = pd.DataFrame()
...: for i in range(10):
...: df2.append(df1)
In [11]: df2
Out[11]:
Empty DataFrame
Columns: []
Index: []
Can someone explain why it works this way? Thanks!
This happens because the .append() method returns a new df:
Pandas Docs (0.19.2):
pandas.DataFrame.append
Returns: appended: DataFrame
Here's a working example so you can see what's happening in each iteration of the loop:
df1 = pd.DataFrame([[1,2],], columns=['a','b'])
df2 = pd.DataFrame()
for i in range(0,2):
print(df2.append(df1))
> a b
> 0 1 2
> a b
> 0 1 2
If you assign the output of .append() to a df (even the same one) you'll get what you probably expected:
for i in range(0,2):
df2 = df2.append(df1)
print(df2)
> a b
> 0 1 2
> 0 1 2
I think what you are looking for is:
df1 = pd.DataFrame()
df2 = pd.DataFrame([[1,2,3],], columns=['a','b','c'])
for i in range(0,4):
df1 = df1.append(df2)
df1
df.append() returns a new object. df2 is a empty dataframe initially, and it will not change. if u do a df3=df2.append(df1), u will get what u want
I want to change the orders of data frames using for loop but it doesn't work. My code is as follows:
import pandas as pd
df1 = pd.DataFrame({'a':1, 'b':2}, index=1)
df2 = pd.DataFrame({'c':3, 'c':4}, index=1)
for df in [df1, df2]:
df = df.loc[:, df.columns.tolist()[::-1]]
Then the order of columns of df1 and df2 is not changed.
You can make use of chain assignment with list comprehension i.e
df1,df2 = [i.loc[:,i.columns[::-1]] for i in [df1,df2]]
print(df1)
b a
1 2 1
print(df2)
c
1 4
Note: In my answer I am trying to build up to show that using a dictionary to store the datafrmes is the best way for a general case. If you are looking to mutate the original dataframe variables, #Bharath answer is the way to go.
Answer:
The code doesn't work because you are not assigning back to the list of dataframes. Here's how to fix that:
import pandas as pd
df1 = pd.DataFrame({'a':1, 'b':2}, index=[1])
df2 = pd.DataFrame({'c':3, 'c':4}, index=[1])
l = [df1, df2]
for i, df in enumerate(l):
l[i] = df.loc[:, df.columns.tolist()[::-1]]
so the difference, is that I iterate with enumerate to get the dataframe and it's index in the list, then I assign the changed dataframe to the original position in the list.
execution details:
Before apply the change:
In [28]: for i in l:
...: print(i.head())
...:
a b
1 1 2
c
1 4
In [29]: for i, df in enumerate(l):
...: l[i] = df.loc[:, df.columns.tolist()[::-1]]
...:
After applying the change:
In [30]: for i in l:
...: print(i.head())
...:
b a
1 2 1
c
1 4
Improvement proposal:
It's better to use a dictionary as follows:
import pandas as pd
d= {}
d['df1'] = pd.DataFrame({'a':1, 'b':2}, index=[1])
d['df2'] = pd.DataFrame({'c':3, 'c':4}, index=[1])
for i,df in d.items():
d[i] = df.loc[:, df.columns.tolist()[::-1]]
Then you will be able to reference your dataframes from the dictionary. For instance d['df1']
You can reverse columns and values:
import pandas as pd
df1 = pd.DataFrame({'a':1, 'b': 2}, index=[1])
df2 = pd.DataFrame({'c':3, 'c': 4}, index=[1])
print('before')
print(df1)
for df in [df1, df2]:
df.values[:,:] = df.values[:, ::-1]
df.columns = df.columns[::-1]
print('after')
print(df1)
df1
Output:
before
a b
1 1 2
after
b a
1 2 1