What is the most effective way for iterate over dataframe and do sql query and then save as dataframe each row in pandas - pandas

i have a dataframe like this:
import pandas as pd
import sqlalchemy
con = sqlalchemy.create_engine('....')
df=pd.DataFrame({'user_id':[1,2,3],'start_date':pd.Series(['2022-05-01 00:00:00','2022-05-10 00:00:00','2022-05-20 00:00:00'],dtype='datetime64[ns]'),
'end_date':pd.Series(['2022-06-01 00:00:00','2022-06-10 00:00:00','2022-06-20 00:00:00'],dtype='datetime64[ns]')})
'''
user_id start_date end_date
1 2022-05-01 00:00:00 2022-06-01 00:00:00
2 2022-05-10 00:00:00 2022-06-10 00:00:00
3 2022-05-20 00:00:00 2022-06-20 00:00:00
'''
I want to get the sales data for each user from the database in the date ranges specified in the df. Below is a code that I am currently using and it is working correctly.
df_stats=pd.DataFrame()
for k,j in df.iterrows():
sql='''
select '{}' as user_id,sum(item_price) as sales,count(return) as return from sales
where created_at between '{}' and '{}' and user_id={}'''.format(j['user_id'],j['start_date'],j['end_date'],j['user_id'])
sql_to_df = pd.read_sql(sql, con)
df_stats = df_stats.append(sql_to_df)
final=df.merge(df_stats,on='user_id')
'''
final:
user_id start_date end_date sales return
1 2022-05-01 00:00:00 2022-06-01 00:00:00 1500 5
2 2022-05-10 00:00:00 2022-06-10 00:00:00 2900 9
3 2022-05-20 00:00:00 2022-06-20 00:00:00 1450 1
'''
But in the articles I read it is mentioned that using iterrows() is very slow. Is there a way to make this process more efficient ?
note: Similar to my question but i couldn't find a satisfactory answer in this previously asked question.

You can use .to_records to transform the rows to a list of tuples. Then iterate the list and unpack the tuple and pass the args to "your_sql_function"
import pandas as pd
data = {
"user_id": [1, 2, 3],
"start_date": pd.Series(["2022-05-01 00:00:00", "2022-05-10 00:00:00", "2022-05-20 00:00:00"], dtype="datetime64[ns]"),
"end_date": pd.Series(["2022-06-01 00:00:00", "2022-06-10 00:00:00", "2022-06-20 00:00:00"], dtype="datetime64[ns]")
}
df = pd.DataFrame(data)
for user, start, end in df.to_records(index=False):
your_sql_function(user, start, end)

Related

Count how many non-zero entries at each month in a dataframe column

I have a dataframe, df, with datetimeindex and a single column, like this:
I need to count how many non-zero entries i have at each month. For example, according to those images, in January i would have 2 entries, in February 1 entry and in March 2 entries. I have more months in the dataframe, but i guess that explains the problem.
I tried using pandas groupby:
df.groupby(df.index.month).count()
But that just gives me total days at each month and i don't saw any other parameter in count() that i could use here.
Any ideas?
Try index.to_period()
For example:
In [1]: import pandas as pd
import numpy as np
x_df = pd.DataFrame(
{
'values': np.random.randint(low=0, high=2, size=(120,))
} ,
index = pd.date_range("2022-01-01", periods=120, freq="D")
)
In [2]: x_df
Out[2]:
values
2022-01-01 0
2022-01-02 0
2022-01-03 1
2022-01-04 0
2022-01-05 0
...
2022-04-26 1
2022-04-27 0
2022-04-28 0
2022-04-29 1
2022-04-30 1
[120 rows x 1 columns]
In [3]: x_df[x_df['values'] != 0].groupby(lambda x: x.to_period("M")).count()
Out[3]:
values
2022-01 17
2022-02 15
2022-03 16
2022-04 17
can you try this:
#drop nans
import numpy as np
dfx['col1']=dfx['col1'].replace(0,np.nan)
dfx=dfx.dropna()
dfx=dfx.resample('1M').count()

Summarize rows from a Pandas dataframe B that fall in certain time periods from another dataframe A

I am looking for an efficient way to summarize rows (in groupby-style) that fall in a certain time period, using Pandas in Python. Specifically:
The time period is given in dataframe A: there is a column for "start_timestamp" and a column for "end_timestamp", specifying the start and end time of the time period that is to be summarized. Hence, every row represents one time period that is meant to be summarized.
The rows to be summarized are given in dataframe B: there is a column for "timestamp" and a column "metric" with the values to be aggregated (with mean, max, min etc.). In reality, there might be more than just 1 "metric" column.
For every row's time period from dataframe A, I want to summarice the values of the "metric" column in dataframe B that fall in the given time period. Hence, the number of rows of the output dataframe will be exactly the same as the number of rows of dataframe A.
Any hints would be much appreciated.
Additional Requirements
The number of rows in dataframe A and dataframe B may be large (several thousand rows).
There may be many metrics to summarize in dataframe B (~100).
I want to avoid solving this problem with a for loop (as in the reproducible example below).
Reproducible Example
Input Dataframe A
# Input dataframe A
df_a = pd.DataFrame({
"start_timestamp": ["2022-08-09 00:30", "2022-08-09 01:00", "2022-08-09 01:15"],
"end_timestamp": ["2022-08-09 03:30", "2022-08-09 04:00", "2022-08-09 08:15"]
})
df_a.loc[:, "start_timestamp"] = pd.to_datetime(df_a["start_timestamp"])
df_a.loc[:, "end_timestamp"] = pd.to_datetime(df_a["end_timestamp"])
print(df_a)
start_timestamp
end_timestamp
0
2022-08-09 00:30:00
2022-08-09 03:30:00
1
2022-08-09 01:00:00
2022-08-09 04:00:00
2
2022-08-09 01:15:00
2022-08-09 08:15:00
Input Dataframe B
# Input dataframe B
df_b = pd.DataFrame({
"timestamp":[
"2022-08-09 01:00",
"2022-08-09 02:00",
"2022-08-09 03:00",
"2022-08-09 04:00",
"2022-08-09 05:00",
"2022-08-09 06:00",
"2022-08-09 07:00",
"2022-08-09 08:00",
],
"metric": [1, 2, 3, 4, 5, 6, 7, 8],
})
df_b.loc[:, "timestamp"] = pd.to_datetime(df_b["timestamp"])
print(df_b)
timestamp
metric
0
2022-08-09 01:00:00
1
1
2022-08-09 02:00:00
2
2
2022-08-09 03:00:00
3
3
2022-08-09 04:00:00
4
4
2022-08-09 05:00:00
5
5
2022-08-09 06:00:00
6
6
2022-08-09 07:00:00
7
7
2022-08-09 08:00:00
8
Expected Output Dataframe
# Expected output dataframe
df_target = df_a.copy()
for i, row in df_target.iterrows():
condition = (df_b["timestamp"] >= row["start_timestamp"]) & (df_b["timestamp"] <= row["end_timestamp"])
df_b_sub = df_b.loc[condition, :]
df_target.loc[i, "metric_mean"] = df_b_sub["metric"].mean()
df_target.loc[i, "metric_max"] = df_b_sub["metric"].max()
df_target.loc[i, "metric_min"] = df_b_sub["metric"].min()
print(df_target)
start_timestamp
end_timestamp
metric_mean
metric_max
metric_min
0
2022-08-09 00:30:00
2022-08-09 03:30:00
2.0
3.0
1.0
1
2022-08-09 01:00:00
2022-08-09 04:00:00
2.5
4.0
1.0
2
2022-08-09 01:15:00
2022-08-09 08:15:00
5.0
8.0
2.0
You can use pd.IntervalIndex and contains to create a dataframe with selected metric values and then compute the mean, max, min:
ai = pd.IntervalIndex.from_arrays(
df_a["start_timestamp"], df_a["end_timestamp"], closed="both"
)
t = df_b.apply(
lambda x: pd.Series((ai.contains(x["timestamp"])) * x["metric"]), axis=1
)
df_a[["metric_mean", "metric_max", "metric_min"]] = t[t.ne(0)].agg(
["mean", "max", "min"]
).T.values
print(df_a):
start_timestamp end_timestamp metric_mean metric_max metric_min
0 2022-08-09 00:30:00 2022-08-09 03:30:00 2.0 3.0 1.0
1 2022-08-09 01:00:00 2022-08-09 04:00:00 2.5 4.0 1.0
2 2022-08-09 01:15:00 2022-08-09 08:15:00 5.0 8.0 2.0
Check Below Code using SQLITE3
import sqlite3
conn = sqlite3.connect(':memory:')
df_a.to_sql('df_a',con=conn, index=False)
df_b.to_sql('df_b',con=conn, index=False)
pd.read_sql("""SELECT df_a.start_timestamp, df_a.end_timestamp
, AVG(df_b.metric) as metric_mean
, MAX(df_b.metric) as metric_max
, MIN(df_b.metric) as metric_min
FROM
df_a INNER JOIN df_b
ON df_b.timestamp BETWEEN df_a.start_timestamp AND df_a.end_timestamp
GROUP BY df_a.start_timestamp, df_a.end_timestamp""", con=conn)
Output:

Create a row for each year between two dates

I have a dataframe with two date columns (format: YYYY-MM-DD). I want to create one row for each year between those two dates. The rows would be identical with a new column which specifies the year. For example, if the dates are 2018-01-01 and 2020-01-01 then there would be three rows with same data and a new column with values 2018, 2019, and 2020.
You can use a custom function to compute the range then explode the column:
# Ensure to have datetime
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
# Create the new column
date_range = lambda x: range(x['date1'].year, x['date2'].year+1)
df = df.assign(year=df.apply(date_range, axis=1)).explode('year', ignore_index=True)
Output:
>>> df
date1 date2 year
0 2018-01-01 2020-01-01 2018
1 2018-01-01 2020-01-01 2019
2 2018-01-01 2020-01-01 2020
This should work for you:
import pandas
# some sample data
df = pandas.DataFrame(data={
'foo': ['bar', 'baz'],
'date1':['2018-01-01', '2022-01-01'],
'date2':['2020-01-01', '2017-01-01']
})
# cast date columns to datetime
for col in ['date1', 'date2']:
df[col] = pandas.to_datetime(df[col])
# reset index to ensure that selection by length of index works
df = df.reset_index(drop=True)
# the range of years between the two dates, and iterate through the resulting
# series to unpack the range of years and add a new row with the original data and the year
for i, years in df.apply(
lambda x: range(
min(x.date1, x.date2).year,
max(x.date1, x.date2).year + 1
),
axis='columns'
).iteritems():
for year in years:
new_index = len(df.index)
df.loc[new_index] = df.loc[i].values
df.loc[new_index, 'year'] = int(year)
output:
>>> df
foo date1 date2 year
0 bar 2018-01-01 2020-01-01 NaN
1 baz 2022-01-01 2017-01-01 NaN
2 bar 2018-01-01 2020-01-01 2018.0
3 bar 2018-01-01 2020-01-01 2019.0
4 bar 2018-01-01 2020-01-01 2020.0
5 baz 2022-01-01 2017-01-01 2017.0
6 baz 2022-01-01 2017-01-01 2018.0
7 baz 2022-01-01 2017-01-01 2019.0
8 baz 2022-01-01 2017-01-01 2020.0
9 baz 2022-01-01 2017-01-01 2021.0
10 baz 2022-01-01 2017-01-01 2022.0

Vectorize a function for a GroupBy Pandas Dataframe

I have a Pandas dataframe sorted by a datetime column. Several rows will have the same datetime, but the "report type" column value is different. I need to select just one of those rows based on a list of preferred report types. The list is in order of preference. So, if one of those rows has the first element in the list, then that is the row chosen to be appended to a new dataframe.
I've tried a GroupBy and the ever so slow Python for loops to process each group to find the preferred report type and append that row to a new dataframe. I thought about the numpy vectorize(), but I don't know how to incorporate the group by in it. I really don't know much about dataframes but am learning. Any ideas on how to make it faster? Can I incorporate the group by?
The example dataframe
OBSERVATIONTIME REPTYPE CIGFT
2000-01-01 00:00:00 AUTO 73300
2000-01-01 00:00:00 FM-15 25000
2000-01-01 00:00:00 FM-12 3000
2000-01-01 01:00:00 SAO 9000
2000-01-01 01:00:00 FM-16 600
2000-01-01 01:00:00 FM-15 5000
2000-01-01 01:00:00 AUTO 5000
2000-01-01 02:00:00 FM-12 12000
2000-01-01 02:00:00 FM-15 15000
2000-01-01 02:00:00 FM-16 8000
2000-01-01 03:00:00 SAO 700
2000-01-01 04:00:00 SAO 3000
2000-01-01 05:00:00 FM-16 5000
2000-01-01 06:00:00 AUTO 15000
2000-01-01 06:00:00 FM-12 12500
2000-01-01 06:00:00 FM-16 12000
2000-01-01 07:00:00 FM-15 20000
#################################################
# The function to loop through and find the row
################################################
def select_the_one_ob(df):
''' select the preferred observation '''
tophour_df = pd.DataFrame()
preferred_order = ['FM-15', 'AUTO', 'SAO', 'FM-16', 'SAOSP', 'FM-12',
'SY-MT', 'SY-SA']
grouped = df.groupby("OBSERVATIONTIME", as_index=False)
for name, group in grouped:
a_group_df = pd.DataFrame(grouped.get_group(name))
for reptype in preferred_order:
preferred_found = False
for i in a_group_df.index.values:
if a_group_df.loc[i, 'REPTYPE'] == reptype:
tophour_df =
tophour_df.append(a_group_df.loc[i].transpose())
preferred_found = True
break
if preferred_found:
break
del a_group_df
return tophour_df
################################################
### The function which calls the above function
################################################
def process_ceiling(plat, network):
platformcig.data_pull(CONNECT_SRC, PULL_CEILING)
data_df = platformcig.df
data_df = select_the_one_ob(data_df)
With the complete dataset of 300,000 rows, the function takes over 4 hours.
I need it to be much faster. Can I incorporate the group by in numpy vectorize()?
You can avoid to use groupby. One way could be to categorize your column 'REPTYPE' with pd.Categorical and then sort_values and drop_duplicates such as:
def select_the_one_ob(df):
preferred_order = ['FM-15', 'AUTO', 'SAO', 'FM-16', 'SAOSP', 'FM-12', 'SY-MT', 'SY-SA']
df.REPTYPE = pd.Categorical(df.REPTYPE, categories=preferred_order, ordered=True)
return (df.sort_values(by=['OBSERVATIONTIME','REPTYPE'])
.drop_duplicates(subset='OBSERVATIONTIME', keep='first'))
and you get with your example:
OBSERVATIONTIME REPTYPE CIGFT
1 2000-01-01 00:00:00 FM-15 25000
5 2000-01-01 01:00:00 FM-15 5000
8 2000-01-01 02:00:00 FM-15 15000
10 2000-01-01 03:00:00 SAO 700
11 2000-01-01 04:00:00 SAO 3000
12 2000-01-01 05:00:00 FM-16 5000
13 2000-01-01 06:00:00 AUTO 15000
16 2000-01-01 07:00:00 FM-15 20000
Found that creating a separate dataframe of the same shape populated with each hour of the observation time, I could use use pandas dataframe merge() and after the first pass use pandas dataframe combine_first(). This took only minutes instead of hours.
def select_the_one_ob(df):
''' select the preferred observation
Parameters:
df (Pandas Object), a Pandas dataframe
Returns Pandas Dataframe
'''
dshelldict = {'DateTime': pd.date_range(BEG_POR, END_POR, freq='H')}
dshell = pd.DataFrame(data = dshelldict)
dshell['YEAR'] = dshell['DateTime'].dt.year
dshell['MONTH'] = dshell['DateTime'].dt.month
dshell['DAY'] = dshell['DateTime'].dt.day
dshell['HOUR'] = dshell['DateTime'].dt.hour
dshell = dshell.set_index(['YEAR','MONTH','DAY','HOUR'])
df = df.set_index(['YEAR','MONTH','DAY','HOUR'])
#tophour_df = pd.DataFrame()
preferred_order = ['FM-15', 'AUTO', 'SAO', 'FM-16', 'SAOSP', 'FM-12', 'SY-MT', 'SY-SA']
reptype_list = list(df.REPTYPE.unique())
# remove the preferred report types from the unique ones
for rep in preferred_order:
if rep in reptype_list:
reptype_list.remove(rep)
# If there are any unique report types left, append them to the preferred list
if len(reptype_list) > 0:
preferred_order = preferred_order + reptype_list
## i is flag to make sure a report type is used to transfer columns to new DataFrame
## (Merge has to happen before combine first)
first_pass = True
for reptype in preferred_order:
if first_pass:
## if there is data in dataframe
if df[(df['MINUTE']==00)&(df['REPTYPE']==reptype)].shape[0]>0:
first_pass = False
# Merge shell with first df with data, the dataframe is sorted by original
# obstime and drop any dup's keeping first aka. first report chronologically
tophour_df = dshell.merge( df[ (df['MINUTE']==00)&(df['REPTYPE']==reptype) ].sort_values(['OBSERVATIONTIME'],ascending=True).drop_duplicates(subset=['ROLLED_OBSERVATIONTIME'],keep='first'),how ='left',left_index = True,right_index=True ).drop('DateTime',axis=1)
else:
# combine_first takes the original dataframe and fills any nan values with data
# of another identical shape dataframe
# ex. if value df.loc[2,col1] is nan df2.loc[2,col1] would fill it if not nan
tophour_df = tophour_df.combine_first(df[(df['MINUTE']==00)&(df['REPTYPE']==reptype)].sort_values(['OBSERVATIONTIME'],ascending=True).drop_duplicates(subset=['ROLLED_OBSERVATIONTIME'],keep='first'))
tophour_df = tophour_df.reset_index()
return tophour_df

How to change datetime to numeric discarding 0s at end [duplicate]

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494