How to subtract sales for month 1 and month 2 for every customer in my dataframe using pandas? - pandas

This is my data frame
`
c = pd.DataFrame({"Product":["p1","p1","p2","p2","p3","p3","p4","p4"],
"sales":[10000,20000,30000,40000,10000,24000,13000,20000],
"Month":["M1","M2","M1","M2","M1","M2","M1","M2"]})
`
The answer should be another dataframe
I tired using boolean masking but I am not sure how to work with both the columns.

Is this what you are looking for?:
import pandas as pd
import numpy as np
c = pd.DataFrame({"Product":["p1","p1","p2","p2","p3","p3","p4","p4"],
"sales":[10000,20000,30000,40000,10000,24000,13000,20000],
"Month":["M1","M2","M1","M2","M1","M2","M1","M2"]})
c['sales'] = np.where(c['Month'] == "M2", c['sales'] * -1, c['sales'])
c.groupby('Product').sum()
This will work only in the case where you have only 'M1' and 'M2'

Related

Converting SQL query into pandas syntax

I am very new to Pandas. How do I convert the following query into pandas syntax. I am no longer querying a MS Access table, I am now querying a pandas DataFrame called df.
The query is:
SELECT
Short_ID,
SUM(IIF(Status = 'Completed', 1, 0))) / COUNT (Status) AS completion_metric
FROM
PROMIS_LT_Long_ID
GROUP BY
Short_ID;
The query results would be something like this:
Short_ID | completion_metric
---------+------------------
1004 | 0.125
1005 | 0
1004 | 0.5
I have created the pandas df with the following code and now I would like to query the pandas DataFrame and obtain the same result as the query above.
import pyodbc
import pandas as pd
def connect_to_db():
db_name = "imuscigrp"
conn = pyodbc.connect(r'DRIVER={SQL Server};SERVER=tcp:SQLDCB301P.uhn.ca\SQLDCB301P;DATABASE=imucsigrp'
r';UID=imucsigrp_data_team;PWD=Kidney123!')
cursor = conn.cursor()
return cursor, conn
def completion_metric():
SQL_Query = pd.read-sql_query('SELECT PROMIS_LT_Long_ID.Short_ID, PROMIS_LT_Long_ID.Status FROM PROMIS_LT_Long_ID', conn)
#converts SQL_Query into Pandas dataframe
df = pd.DataFrame(SQL_Query, columns = ["Short_ID", "Status"])
#querying the df to obtain longitudinal completion metric values
return
Any contributions will help, thank you
You can use some numpy functions for performing similar operations.
For example, numpy.where to replace the value based on a condition.
import numpy as np
df = pd.DataFrame(SQL_Query, columns = ["Short_ID", "Status"])
df["completion_metric"] = np.where(df.Status == "Completed", 1, 0)
Then numpy.average to compute an average value for the grouped data.
completion_metric = df.groupby("Short_ID").agg({"completion_metric": np.average})

Dask .loc only the first result (iloc[0])

Sample dask dataframe:
import pandas as pd
import dask
import dask.dataframe as dd
df = pd.DataFrame({'col_1': [1,2,3,4,5,6,7], 'col_2': list('abcdefg')},
index=pd.Index([0,0,1,2,3,4,5]))
df = dd.from_pandas(df, npartitions=2)
Now I would like to only get first (based on the index) result back - like this in pandas:
df.loc[df.col_1 >3].iloc[0]
col_1 col_2
2 4 d
I know there is no positional row indexing in dask using iloc, but I wonder if it would be possible to limit the query to 1 result like in SQL?
Got it - But not sure about the efficiency here:
tmp = df.loc[df.col_1 >3]
tmp.loc[tmp.index == tmp.index.min().compute()].compute()

Multiprocessing the Fuzzy match in pandas

I have two data frames.
DF_Address, which is having 347k distinct addresses and DF_Project which is having 24k records having
Project_Id, Project_Start_Date and Project_Address
I want to check if there is a fuzzy match of my Project_Address in Df_Address. If there is a match, I want to extract the Project_ID and Project_Start_Date for the same. Below is code of what I am trying
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Df_Address = pd.read_csv("Cantractor_Addresses.csv")
Df_Project = pd.read_csv("Project_info.csv")
#address = list(Df_Project["Project_Address"])
def fuzzy_match(x, choices, cutoff):
print(x)
return process.extractOne(
x, choices=choices, score_cutoff=cutoff
)
Matched = Df_Address ["Address"].apply(
fuzzy_match,
args=(
Df_Project ["Project_Address"],
80
)
)
This code does provide an output in the form of a tuple
('matched_string', score)
But it is also giving similar strings. Also I need to extract
Project_Id and Project_Start_Date
. Can someone help me to achieve this using parallel processing as the data is huge.
You can convert the tuple into dataframe and then join out to your base data frame.
import pandas as pd
Df_Address = pd.DataFrame({'address': ['abc','cdf'],'random_stuff':[100,200]})
Matched = (('abc',10),('cdf',20))
dist = pd.DataFrame(x)
dist.columns = ['address','distance']
final = Df_Address.merge(dist,how='left',on='address')
print(final)
Output:
address random_stuff distance
0 abc 100 10
1 cdf 200 20

dataframe python converting to weekday from year, month, day

I am trying to add new Dataframe column by manipulating other cols.
import pandas as pd
import numpy as np
from pandas import DataFrame, read_csv
from pandas import read_csv
import datetime
df = pd.read_csv('PRSA_data_2010.1.1-2014.12.31.csv')
df.head()
When I am trying to manipulate
df['weekday']= np.int(datetime.datetime(df.year, df.month, df.day).weekday())
I am keep getting error cannot convert the series to class 'int'.
Can anyone tell me a reason behind this and how I can fix it?
Thanks in advance!
Convert columns to datetimes and then to weekdays by Series.dt.weekday:
df['weekday'] = pd.to_datetime(df[['year', 'month', 'day']].dt.weekday
Or convert columns to datetime column in read_csv:
df = pd.read_csv('PRSA_data_2010.1.1-2014.12.31.csv',
date_parser=lambda y,m,d: y + '-' + m + '-' + d,
parse_dates={'datetimes':['year','month','day']})
df['weekday'] = df['datetimes'].dt.weekday

Pandas- conditional information retrieval with on a date range

I'm still fairly new to pandas and the script i wrote to accomplish a seemily easy task seems needlessly complicated. If you guys know of an easier way to accomplish this I would be extremely grateful.
task:
I hate two spreadsheets (df1&df2), each with an identifier (mrn) and a date. my task is to retrieve an value from df2 for each row in df1 if the following conditions are met:
the identifier for a given row in df1 exists in df2
if above is true, then retrieve the value in df2 if the associated date is within a +/-5 day range from the date in df1.
I have written the following code which accomplishes this:
#%%housekeeping
import numpy as np
import pandas as pd
import csv
import datetime
from datetime import datetime, timedelta
import sys
from io import StringIO
#%%dataframe import
df1=',mrn,date,foo\n0,1,2015-03-06,n/a\n1,11,2009-08-14,n/a\n2,14,2009-05-18,n/a\n3,20,2010-06-19,n/a\n'
df2=',mrn,collection Date,Report\n0,1,2015-03-06,report to import1\n1,11,2009-08-12,report to import11\n2,14,2009-05-21,report to import14\n3,20,2010-06-25,report to import20\n'
df1 = pd.read_csv(StringIO(df1))
df2 = pd.read_csv(StringIO(df2))
#converting to date-time format
df1['date']=pd.to_datetime(df1['date'])
df2['collection Date']=pd.to_datetime(df2['collection Date'])
#%%mask()
def mask(df2, rangeTime):
mask= (df2> rangeTime -timedelta(days=5)) & (df2 <= rangeTime + timedelta(days=5))
return mask
#%% detailLoop()
i=0
for element in df1["mrn"]:
df1DateIter = df1.ix[i, 'date']
df2MRNmatch= df2.loc[df2['mrn']==element, ['collection Date', 'Report']]
df2Date= df2MRNmatch['collection Date']
df2Report= df2MRNmatch['Report']
maskOut= mask(df2Date, df1DateIter)
dateBoolean= maskOut.iloc[0]
if dateBoolean==True:
df1.ix[i, 'foo'] = df2Report.iloc[0]
i+=1
#: once the script has been run the df1 looks like:
Out[824]:
mrn date foo
0 1 2015-03-06 report to import1
1 11 2009-08-14 report to import11
2 14 2009-05-18 report to import14
3 20 2010-06-19 NaN