Using query to combine pandas dataframes - sql

I'm working on a problem where I need to merge two dataframes together and apply a condition similar to the 'where' clause in SQL. To start I have two dataframes with me :
Member_Timepoints = pd.DataFrame(list(zip([1001,1001,1002,1003],['2016-09-02','2018-01-30','2018-03-17','2019-01-10'])),columns = ['Member_ID','Discharge_Date'])
Enrollment_Information = pd.DataFrame(list(zip([1001,1001,1002,1003,1003,1003,1003], ['2015-07-01','2018-01-01','2018-03-01','2017-11-01','2018-08-01','2019-07-01','2019-09-01'], ['2018-01-01','2262-04-11','2018-08-01','2018-08-01','2019-06-01','2019-08-01','2262-04-11'])), columns = ['Member_ID','Coverage_Effective_Date','Coverage_Cancel_Date'])
Member_Timepoints['Discharge_Date'] = pd.to_datetime(Member_Timepoints['Discharge_Date'])
Enrollment_Information['Coverage_Effective_Date'] = pd.to_datetime(Enrollment_Information['Coverage_Effective_Date'])
Enrollment_Information['Coverage_Cancel_Date'] = pd.to_datetime(Enrollment_Information['Coverage_Cancel_Date'])
I need to join these dataframes together on 'Member_ID' and want to use the following condition as a filtration criteria :
Coverage_Effective_Date <= Discharge_Date and Coverage_Cancel_Date >= Discharge_Date + 30
I referred Join pandas dataframes based on different conditions to start, However, I am still struggling to merge the dataframes together with the above condition applied.
Can anyone please help me to implement this in pandas using query?

The first thing I've seen in this condition is data type and integer addition. You cannot add different data type. You should use timedelta:
from datetime import timedelta
some_date_type + timedelta(days=30)
For the query part you can use .loc after merging:
data = Enrollment_Information.merge(Member_Timepoints, on=['Member_ID'])
data.loc[(data['Coverage_Cancel_Date'] <= data['Discharge_Date'] ) &
(data['Coverage_Cancel_Date'] >= data['Discharge_Date']+timedelta(days=30)) ]

Related

Numba - how to return multiple columns ( arrays) - after group by apply

I would like to run groupby and then apply Numba function on top of a pandas.
This is the example :
#nb.jit(nopython=True)
def my_Numba_function(arr1,arr2):
arr1[:] =11
arr2[:] =22
return arr1,arr2
and_df= df_input_imputed.groupby(key_cols_list, as_index = True)[['col_1,'col_2']].\
apply(lambda x: pd.DataFrame(my_Numba_function(arr1 = x['col_1'].values,arr2 = x['col_2'].values)) )
And instead of getting two columns additional to my index, I am getting the results in a lot of columns ( rows became columns).
How can I fix this ?
Thanks,
Boris

How to return distinct rows while keeping the ordering in a query (SQL Alchemy)

I've been stuck on this for a few days now. An event can have multiple dates, and I want the query to only return the date closest to today (the next date). I have considered querying for Events and then adding a hybrid property to Event that returns the next Event Date but I believe this won't work out (such as if I want to query EventDates in a certain range).
I'm having a problem with distinct() not working as I would expect. Keep in mind I'm not a SQL expert. Also, I'm using postgres.
My query starts like this:
distance_expression = func.ST_Distance(
cast(EventLocation.geo, Geography(srid=4326)),
cast("SRID=4326;POINT(%f %f)" % (lng, lat), Geography(srid=4326)),
)
query = (
db.session.query(EventDate)
.populate_existing()
.options(
with_expression(
EventDate.distance,
distance_expression,
)
)
.join(Event, EventDate.event_id == Event.id)
.join(EventLocation, EventDate.location_id == EventLocation.id)
)
And then I have multiple filters (just showing a few for as an example)
query = query.filter(EventDate.start >= datetime.utcnow)
if kwargs.get("locality_id", None) is not None:
query = query.filter(EventLocation.locality_id == kwargs.pop("locality_id"))
if kwargs.get("region_id", None) is not None:
query = query.filter(EventLocation.region_id == kwargs.pop("region_id"))
if kwargs.get("country_id", None) is not None:
query = query.filter(EventLocation.country_id == kwargs.pop("country_id"))
Then I want to order by date and distance (using my query expression)
query = query.order_by(
EventDate.start.asc(),
distance_expression.asc(),
)
And finally I want to get distinct rows, and only return the next EventDate of an event, according to the ordering in the code block above.
query = query.distinct(Event.id)
The problem is that this doesn't work and I get a database error. This is what the generated SQL looks like:
SELECT DISTINCT ON (events.id) ST_Distance(CAST(event_locations.geo AS geography(GEOMETRY,4326)), CAST(ST_GeogFromText(%(param_1)s) AS geography(GEOMETRY,4326))) AS "ST_Distance_1", event_dates.id AS event_dates_id, event_dates.created_at AS event_dates_created_at, event_dates.event_id AS event_dates_event_id, event_dates.tz AS event_dates_tz, event_dates.start AS event_dates_start, event_dates."end" AS event_dates_end, event_dates.start_naive AS event_dates_start_naive, event_dates.end_naive AS event_dates_end_naive, event_dates.location_id AS event_dates_location_id, event_dates.description AS event_dates_description, event_dates.description_attribute AS event_dates_description_attribute, event_dates.url AS event_dates_url, event_dates.ticket_url AS event_dates_ticket_url, event_dates.cancelled AS event_dates_cancelled, event_dates.size AS event_dates_size
FROM event_dates JOIN events ON event_dates.event_id = events.id JOIN event_locations ON event_dates.location_id = event_locations.id
WHERE events.hidden = false AND event_dates.start >= %(start_1)s AND (event_locations.lat BETWEEN %(lat_1)s AND %(lat_2)s OR false) AND (event_locations.lng BETWEEN %(lng_1)s AND %(lng_2)s OR false) AND ST_DWithin(CAST(event_locations.geo AS geography(GEOMETRY,4326)), CAST(ST_GeogFromText(%(param_2)s) AS geography(GEOMETRY,4326)), %(ST_DWithin_1)s) ORDER BY event_dates.start ASC, ST_Distance(CAST(event_locations.geo AS geography(GEOMETRY,4326)), CAST(ST_GeogFromText(%(param_3)s) AS geography(GEOMETRY,4326))) ASC
I've tried a lot of different things and orderings but I can't work this out. I've also tried to create a subquery at the end using from_self() but it doesn't keep the ordering.
Any help would be much appreciated!
EDIT:
On further experimentation it seems that I can't use order_by will only work if it's ordering the same field that I'm using for distinct(). So
query = query.order_by(EventDate.event_id).distinct(EventDate.event_id)
will work, but
query.order_by(EventDate.start).distinct(EventDate.event_id)
will not :/
I solved this by using adding a row_number column and then filtering by the first row numbers like in this answer:
filter by row_number in sqlalchemy

Self joining columns from the same table with calculation on one column not displaying column name

I am fairly new to SQL and having issues figuring out how to solve the simple issue below. I have a dataset I am trying to self-join, I am using (b.calendar_year_number -1) as one of the columns to join. I applied a calculation of -1 with the goal of trying to match values from the previous year. However, it is not working as the resulting column shows (No column name) with a screenshot attached below. How do I change the alias to b.calendar_year_number after the calculation?
Code:
SELECT a.day_within_fiscal_period,
a.calendar_month_name,
a.cost_period_rolling_three_month_start_date,
a.calendar_year_number,
b.day_within_fiscal_period,
b.calendar_month_name,
b.cost_period_rolling_three_month_start_date,
(b.calendar_year_number -1)
FROM [data_mart].[v_dim_date_consumer_complaints] AS a
JOIN [data_mart].[v_dim_date_consumer_complaints] AS b
ON b.day_within_fiscal_period = a.day_within_fiscal_period AND
b.calendar_month_name = a.calendar_month_name AND
b.calendar_year_number = a.calendar_year_number
I am using (b.calendar_year_number -1) as one of the columns to join.
Nope, you're not. Look at your join statement and you'll see the third condition is:
b.calendar_year_number = a.calendar_year_number
So just change that to include the calculation. As far as the 'no column name' issue, you can use colname = somelogic syntax or somelogic as colname. Below, I used the former syntax.
select a.day_within_fiscal_period,
a.calendar_month_name,
a.cost_period_rolling_three_month_start_date,
a.calendar_year_number,
b.day_within_fiscal_period,
b.calendar_month_name,
b.cost_period_rolling_three_month_start_date,
bCalYearNum = b.calendar_year_number
from [data_mart].[v_dim_date_consumer_complaints] a
left join [data_mart].[v_dim_date_consumer_complaints] b
on b.day_within_fiscal_period = a.day_within_fiscal_period
and b.calendar_month_name = a.calendar_month_name
and b.calendar_year_number - 1 = a.calendar_year_number;
You could use the analytical function LAG/LEAD to get your required result, no self-join necessary:
select a.day_within_fiscal_period,
a.calendar_month_name,
a.cost_period_rolling_three_month_start_date,
a.calendar_year_number,
old_cost_period_rolling_three_month_start_date =
LAG(cost_period_rolling_three_month_start_date) OVER
(PARTITION BY calendar_month_name, day_within_fiscal_period
ORDER BY calendar_year_number),
old_CalYearNum = LAG(calendar_year_number) OVER
(PARTITION BY calendar_month_name, day_within_fiscal_period
ORDER BY calendar_year_number)
from [data_mart].[v_dim_date_consumer_complaints] a

Is there a way to do a true sql type merge of 2 dataframes

For starter I'll admit that I'm quite new to dataframes/databricks having worked with them for only a few months.
I have two dataframes read from parquet files (full format). In reviewing the documentation it appears that what in pandas is called merge is in fact only a join.
in SQL I would write this step as:
ml_RETURNS_U = sqlContext.sql("""
MERGE INTO U2 as target
USING U as source
ON (
target.ITEMNUMBER = source.ITEMNUMBER
and target.PRODUCTCOLORID = source.PRODUCTCOLORID
and target.WEEK_ID = source.WEEK_ID
)
WHEN MATCHED THEN
UPDATE SET target.RETURNSALESQUANTITY = target.RETURNSALESQUANTITY + source.QTY_DELIVERED
WHEN NOT MATCHED THEN
INSERT (ITEMNUMBER, PRODUCTCOLORID, WEEK_ID, RETURNSALESQUANTITY)
VALUES (source.ITEMNUMBER, source.PRODUCTCOLORID, source.WEEK_ID, source.QTY_DELIVERED)
""")
When I run this command I get the following error: u'MERGE destination only supports Delta sources.\n;'
So I have two questions: Is there a way I can preform this operation using pandas or pySpark?
if not, how can I resolve this error?
You can create your tables using DELTA and perform this operation
see: https://docs.databricks.com/delta/index.html
So you can do a upsert using merge like this: https://docs.databricks.com/delta/delta-batch.html#write-to-a-table

SQL IN and AND clause output

I have written one small query like below. It is giving me output.
select user_id
from table tf
where tf.function_id in ('1001051','1001060','1001061')
but when i am running query like below it is showing 0 out put.however i have verified manually we have user_id's where all the 3 function_id's are present.
select user_id
from table tf
where tf.function_id='1001051'
and
tf.function_id='1001060'
and
tf.function_id='1001061'
it looks very simple to use AND clause. However i am not gettng desired output. AM i doing something wrong?
Thanks in advance
Is this what you want to do?
select tf.user_id
from table tf
where tf.function_id in ('1001051', '1001060', '1001061')
group by tf.user_id
having count(distinct tf.function_id) = 3;
This returns users that have all three functions.
EDIT:
This is the query in your comment:
select tu.dealer_id, tu.usr_alias, tf.function_nm
from t_usr tu, t_usr_function tuf, t_function tf
where tu.usr_id = tuf.usr_id and tuf.function_id = tf.function_id and
tf.function_id = '1001051' and tf.function_id = '1001060' and tf.function_id = '1001061' ;
First, you should learn proper join syntax. Simple rule: Never use commas in the from clause.
I think the query you want is:
select tu.dealer_id, tu.usr_alias
from t_usr tu join
t_usr_function tuf
on tu.usr_id = tuf.usr_id
where tuf.function_id in ('1001051', '1001060', '1001061')
group by tu.dealer_id, tu.usr_alias
having count(distinct tuf.function_id) = 3;
This doesn't give you the function name. I'm not sure why you need such detail if all three functions are there for each "user" (or at least dealer/user alias combination). And, the original question doesn't request this level of detail.
Using 'AND' clause mean that the query should satisfy all of the conditions.
in your case, you need to return either when the function_id='1001051' OR function_id='1001060'.
So in brief you need to replace the AND by OR.
select user_id from table tf
where tf.function_id='1001051' OR tf.function_id='1001060' OR tf.function_id='1001061'
Thats what the IN do, it compares with either of them.
As I pointed out in the comment, AND is not the right operator since all three conditions together will not be met. Use OR instead,
select user_id from table tf
where tf.function_id='1001051' OR tf.function_id='1001060' OR tf.function_id='1001061'
You're asking for the value to be three different values at the same time. A better use would be to use OR instead of AND:
select user_id from table tf
where tf.function_id='1001051' or tf.function_id='1001060' or tf.function_id='1001061'
If all of these things are true:
tf.function_id='1001051'
tf.function_id='1001060'
tf.function_id='1001061'
Then simple algebra tells us this must also be true:
'1001051'='1001060'='1001061'
Since that clearly can't ever be true, your SQL statement's where clause will always resolve to false.
What you want to say is that any of those conditions is true (which is equivalent to in), which means you need to use or:
SELECT user_id
FROM table tf
WHERE tf.function_id = '1001051'
OR tf.function_id = '1001060'
OR tf.function_id = '1001061'
The where clause applies to each row returned by the query. In order to gather data across rows, you either need to join the table to itself enough times to create a single row that satisfies the condition you're looking for or use aggregate functions to consolidate several rows into a single row.
Self-join solution:
SELECT user_id
FROM table tf1
JOIN table tf2 ON tf1.user_id = tf2.user_id
JOIN table tf3 ON tf1.user_id = tf3.user_id
WHERE tf1.function_id = '1001051'
AND tf2.function_id = '1001060'
AND tf3.function_id = '1001061'
Aggregate solution:
SELECT user_id
FROM table tf
WHERE tf.function_id IN ('1001051', '1001060', '1001061')
GROUP BY user_id
HAVING COUNT (DISTINCT tf.function_id) = 3
Try this as this link SQL IN
select function_id, user_id from table tf
where tf.function_id in ('1001051','1001060','1001061')