Converting SQL query into pandas syntax - sql

I am very new to Pandas. How do I convert the following query into pandas syntax. I am no longer querying a MS Access table, I am now querying a pandas DataFrame called df.
The query is:
SELECT
Short_ID,
SUM(IIF(Status = 'Completed', 1, 0))) / COUNT (Status) AS completion_metric
FROM
PROMIS_LT_Long_ID
GROUP BY
Short_ID;
The query results would be something like this:
Short_ID | completion_metric
---------+------------------
1004 | 0.125
1005 | 0
1004 | 0.5
I have created the pandas df with the following code and now I would like to query the pandas DataFrame and obtain the same result as the query above.
import pyodbc
import pandas as pd
def connect_to_db():
db_name = "imuscigrp"
conn = pyodbc.connect(r'DRIVER={SQL Server};SERVER=tcp:SQLDCB301P.uhn.ca\SQLDCB301P;DATABASE=imucsigrp'
r';UID=imucsigrp_data_team;PWD=Kidney123!')
cursor = conn.cursor()
return cursor, conn
def completion_metric():
SQL_Query = pd.read-sql_query('SELECT PROMIS_LT_Long_ID.Short_ID, PROMIS_LT_Long_ID.Status FROM PROMIS_LT_Long_ID', conn)
#converts SQL_Query into Pandas dataframe
df = pd.DataFrame(SQL_Query, columns = ["Short_ID", "Status"])
#querying the df to obtain longitudinal completion metric values
return
Any contributions will help, thank you

You can use some numpy functions for performing similar operations.
For example, numpy.where to replace the value based on a condition.
import numpy as np
df = pd.DataFrame(SQL_Query, columns = ["Short_ID", "Status"])
df["completion_metric"] = np.where(df.Status == "Completed", 1, 0)
Then numpy.average to compute an average value for the grouped data.
completion_metric = df.groupby("Short_ID").agg({"completion_metric": np.average})

Related

How to subtract sales for month 1 and month 2 for every customer in my dataframe using pandas?

This is my data frame
`
c = pd.DataFrame({"Product":["p1","p1","p2","p2","p3","p3","p4","p4"],
"sales":[10000,20000,30000,40000,10000,24000,13000,20000],
"Month":["M1","M2","M1","M2","M1","M2","M1","M2"]})
`
The answer should be another dataframe
I tired using boolean masking but I am not sure how to work with both the columns.
Is this what you are looking for?:
import pandas as pd
import numpy as np
c = pd.DataFrame({"Product":["p1","p1","p2","p2","p3","p3","p4","p4"],
"sales":[10000,20000,30000,40000,10000,24000,13000,20000],
"Month":["M1","M2","M1","M2","M1","M2","M1","M2"]})
c['sales'] = np.where(c['Month'] == "M2", c['sales'] * -1, c['sales'])
c.groupby('Product').sum()
This will work only in the case where you have only 'M1' and 'M2'

Dask .loc only the first result (iloc[0])

Sample dask dataframe:
import pandas as pd
import dask
import dask.dataframe as dd
df = pd.DataFrame({'col_1': [1,2,3,4,5,6,7], 'col_2': list('abcdefg')},
index=pd.Index([0,0,1,2,3,4,5]))
df = dd.from_pandas(df, npartitions=2)
Now I would like to only get first (based on the index) result back - like this in pandas:
df.loc[df.col_1 >3].iloc[0]
col_1 col_2
2 4 d
I know there is no positional row indexing in dask using iloc, but I wonder if it would be possible to limit the query to 1 result like in SQL?
Got it - But not sure about the efficiency here:
tmp = df.loc[df.col_1 >3]
tmp.loc[tmp.index == tmp.index.min().compute()].compute()

Multiprocessing the Fuzzy match in pandas

I have two data frames.
DF_Address, which is having 347k distinct addresses and DF_Project which is having 24k records having
Project_Id, Project_Start_Date and Project_Address
I want to check if there is a fuzzy match of my Project_Address in Df_Address. If there is a match, I want to extract the Project_ID and Project_Start_Date for the same. Below is code of what I am trying
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Df_Address = pd.read_csv("Cantractor_Addresses.csv")
Df_Project = pd.read_csv("Project_info.csv")
#address = list(Df_Project["Project_Address"])
def fuzzy_match(x, choices, cutoff):
print(x)
return process.extractOne(
x, choices=choices, score_cutoff=cutoff
)
Matched = Df_Address ["Address"].apply(
fuzzy_match,
args=(
Df_Project ["Project_Address"],
80
)
)
This code does provide an output in the form of a tuple
('matched_string', score)
But it is also giving similar strings. Also I need to extract
Project_Id and Project_Start_Date
. Can someone help me to achieve this using parallel processing as the data is huge.
You can convert the tuple into dataframe and then join out to your base data frame.
import pandas as pd
Df_Address = pd.DataFrame({'address': ['abc','cdf'],'random_stuff':[100,200]})
Matched = (('abc',10),('cdf',20))
dist = pd.DataFrame(x)
dist.columns = ['address','distance']
final = Df_Address.merge(dist,how='left',on='address')
print(final)
Output:
address random_stuff distance
0 abc 100 10
1 cdf 200 20

pandas Dataframe: How to reshape a single column, converting every n rows to a new column

How could I convert a dataframe like this:
import pandas as pd
A = [0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4]
dfA = pd.DataFrame(A)
to a new dataframe like this:
# Expect output:
B = [[0,1,2,3,4],[0,1,2,3,4],[0,1,2,3,4],[0,1,2,3,4]]
dfB = pd.DataFrame(B)

How can I add values from pandas group to new Dataframe after a function?

I am trying to separate a Dataframe into groups, run each group through a function, and have the return value from the first row of each group placed into a new Dataframe.
When I try the code below, I can print out the information I want, but when I try to add it to the new Dataframe, it only shows the values for the last group.
How can I add the values from each group into the new Dataframe?
Thanks,
Here is what I have so far:
import pandas as pd
import numpy as np
#Build random dataframe
df = pd.DataFrame(np.random.randint(0,40,size=10),
columns=["Random"],
index=pd.date_range("20200101", freq='6h',periods=10))
df["Random2"] = np.random.randint(70,100,size=10)
df["Random3"] = 2
df.index =df.index.map(lambda t: t.strftime('%Y-%m-%d'))
df.index.name = 'Date'
df.reset_index(inplace=True)
#Setup groups by date
df = df.groupby(['Date']).apply(lambda x: x.reset_index())
df.drop(["index","Date"],axis=1,inplace = True)
#Creat new dataframe for newValue
df2 = pd.DataFrame(index=(df.index)).unstack()
#random function for an example
def any_func(df):
df["Value"] = df["Random"] * df["Random2"] / df["Random3"]
return df["Value"]
#loop by unique group name
for date in df.index.get_level_values('Date').unique():
#I can print the data I want
print(any_func(df.loc[date])[0])
#But when I add it to a new dataframe, it only shows the value from the last group
df2["newValue"] = any_func(df.loc[date])[0]
df2
Unrelated, but try modifying your any_func to take advantage of vectorized functions is possible.
Now if I understand you correctly:
new_value = df['Random'] * df['Random2'] / df['Random3']
df2['New Value'] = new_value.loc[:, 0]
This line of code gave me the desired outcome. I just needed to set the index using the "date" variable when I created the column, not when I created the Dataframe.
df2.loc[date, "newValue"] = any_func(df.loc[date])[0]