Excluding IDs with some value day after day - pandas

I have this df:
ID Date X Y
A 16-07-19 123 56
A 17-07-19 456 84
A 18-07-19 0 58
A 19-07-19 123 81
B 19-07-19 456 70
B 21-07-19 789 46
B 22-07-19 0 19
B 23-07-19 0 91
C 14-07-19 0 86
C 16-07-19 456 91
C 17-07-19 456 86
C 18-07-19 0 41
C 19-07-19 456 26
C 20-07-19 456 17
D 06-07-19 789 98
D 08-07-19 789 90
D 09-07-19 0 94
I want to exclude IDs that have any value in X column (except for 0) day after day.
For example: A has the value 123 on 16-07-19, and 456 on 17-07-19. So all A's observations should be excluded.
Expected result:
ID Date X Y
B 19-07-19 456 70
B 21-07-19 789 46
B 22-07-19 0 19
B 23-07-19 0 91
D 06-07-19 789 98
D 08-07-19 789 90
D 09-07-19 0 94

Let's do this in a vectorized manner, to keep our code as efficient as possible
(meaning: we avoid using GroupBy.apply)
First we check if the difference in Date is equal to 1 day
We check if X column is not equal to 0
we create a temporary column m where we check if both conditions are True
We groupby on ID and remove all groups where any of the rows are True
# df['Date'] = pd.to_datetime(df['Date']) <- if Date is not datetime type
m1 = df['Date'].diff(1).eq(pd.Timedelta(1, unit='d'))
m2 = df['X'].ne(0)
df['m'] = m1&m2
df = df[~df.groupby('ID')['m'].transform('any')].drop(columns='m')
ID Date X Y
4 B 2019-07-19 456 70
5 B 2019-07-21 789 46
6 B 2019-07-22 0 19
7 B 2019-07-23 0 91
14 D 2019-06-07 789 98
15 D 2019-08-07 789 90
16 D 2019-09-07 0 94

Related

Selecting rows where values in one column are different, other column(s) values are similar and values of Date columns are correct

Suppose I have the following columns:
ID,Code,DST,Short_text,Long_text,Date_from,Date_until
Here is the dataset:
ID Code DST Short_text Long_text Date_From Date_Until
1 B 01 B 1 Bez1 Bezirk1 29.10.1999 13.01.2020
1 B 01 B 1 Bez1 Bezirk1 14.01.2020 31.12.9999
2 B 02 B 2 Bez2 Bezirk2 29.10.1999 13.01.2020
3 B 03 B 3 Bez3 Bezirk3 14.01.2020 31.12.9999
4 B 04 B 4 Bez4 Bezirk4 29.10.1999 13.01.2020
4 B 04 B 4 Bez4 Bezirk4 14.01.2020 31.12.9999
97 M 51 M 52 MA 51 Sport 29.10.1999 13.01.2020
96 M 51 M 51 MA 51 Sport 14.01.2020 31.12.9999
98 M 55 M 53 MA 53 Dance 29.10.1999 13.01.2020
99 M 55 M 54 MA 54 Skating 14.01.2020 31.12.9999
100 M 56 M 59 MA 57 Football 29.10.1999 13.01.2020
101 M 56 M 56 MA 56 Tennis 29.10.1999 31.12.9999
I want to select rows, such that they have different ID AND (they have similar Code OR SImilar Short_text OR simmlar long_text) AND Correct Date_from - Date_Until.
Definition of correct Date_from - Date_Until:
1.Date_ from < Date_Until
2.Both fields are not Null
3. WHEN PREV_DATE_UNTIL = DATE_FROM - 1 OR PREV_DATE_UNTIL is null THEN 'OK'(PREV_DATE_UNTIL using lag operator)
4. WHEN NEXT_DATE_FROM = DATE_UNTIL + 1 OR NEXT_DATE_FROM is null THEN 'OK'(NEXT_DATE_FROM using lead operator)
Not correct:
WHEN WHEN NEXT_DATE_FROM > DATE_UNTIL + 1 THEN 'Gaps in Dates'
WHEN WHEN NEXT_DATE_FROM < DATE_UNTIL + 1 THEN 'Overlapping dates'
Basically what I mean, that historization of the data must be correct(no overlapping)
At the end I want to select the following rows:
97 M 51 M 52 MA 51 Sport 29.10.1999 13.01.2020
96 M 51 M 51 MA 51 Sport 14.01.2020 31.12.9999
Because they have different ID and similar Code or short_text or long_text and dates are correct according to the definition
And
98 M 55 M 53 MA 53 Dance 29.10.1999 13.01.2020
99 M 55 M 54 MA 54 Skating 14.01.2020 31.12.9999
Because they have different ID and similar Code and dates are correct according to the definition
Rows:
100 M 56 M 59 MA 57 Football 29.10.1999 13.01.2020
101 M 56 M 56 MA 56 Tennis 29.10.1999 31.12.9999
Should NOT be selected, because they have different ID and similar Code BUT they have incorrect Dates(they are overlapping).
This will be something like that:
with t as (
select row_number() over (order by date_from, date_until) rn,
id, code, dst, short_text, long_text, date_from,
nullif(date_until, date '9999-12-31') date_until
from data)
select rna, rnb, description, t.*
from t
join (
select a.rn rna, b.rn rnb,
case when b.date_from = a.date_until + 1 then 'OK'
when b.date_from > a.date_until + 1 then 'gaps'
when b.date_from < a.date_until + 1 then 'overlapping'
end description
from t a
join t b on a.id <> b.id and a.rn < b.rn
and (a.code = b.code or a.short_text = b.short_text
or a.long_text = b.long_text)) pairs
on rn in (rna, rnb)
result is:
RNA RNB DESCRIPTION RN ID CODE DST SHORT_TEXT LONG_TEXT DATE_FROM DATE_UNTIL
------ ------ ----------- ----- ---------- ---- ---- ---------- --------- ----------- -----------
1 7 overlapping 1 100 M 56 M 59 MA 57 Football 1999-10-29 2020-01-13
1 7 overlapping 7 101 M 56 M 56 MA 56 Tennis 1999-10-29
3 8 OK 3 98 M 55 M 53 MA 53 Dance 1999-10-29 2020-01-13
3 8 OK 8 99 M 55 M 54 MA 54 Skating 2020-01-14
6 9 OK 6 97 M 51 M 52 MA 51 Sport 1999-10-29 2020-01-13
6 9 OK 9 96 M 51 M 51 MA 51 Sport 2020-01-14
dbfiddle
I numbered rows, self joined such data and dressed your logic in case when syntax. I tested on your examples, in case of any mistakes please provide dbfiddle if possible.

how to concat corresponding rows value to make column name in pandas?

I have the below dataframe has in a messy way and I need to club row 0 and 1 to make that as columns and keep rest rows from 3 asis:
Start Date 2005-01-01 Unnamed: 3 Unnamed: 4 Unnamed: 5
Dat an_1 an_2 an_3 an_4 an_5
mt mt s t inch km
23 45 67 78 89 9000
change to below dataframe :
Dat_mt an_1_mt an_2 _s an_3_t an_4_inch an_5_km
23 45 67 78 89 9000
IIUC
df.columns=df.loc[0]+'_'+df.loc[1]
df=df.loc[[2]]
df
Out[429]:
Dat_mt an_1_mt an_2_s an_3_t an_4_inch an_5_km
2 23 45 67 78 89 9000

SQL query between and equals

there are three tables, first table name is baseline which contains all beneficiaries information and one column in the name of PPI Score and the second table in the name of PPI_SCORE_TOOKUP which contains six columns as below the third table in the name of endline which contains beneficiaries end line assessment data and also one column in the name of PPI_Score, what i want is, to join some how these tables however there is no foreign key of the baseline and endline table in the PPI_SCORE_TOOKUP table there is only PPI_Score in the tables PPI_SCORE_TOOKUP, endline and endline tables, and i want to query to show some baseline data along PPI result if the values of the ppi in the basline table is between or equals to PPI_SCORE_START and PPI_SCORE_END and also it should show endline data of the same member along with the PPI Score with its six column if ppi score in the endline table is between and equals to PPI_SCORE_START and PPI_SCORE_END all in one row.
Note: i did not try any query yet since i did not have any idea how to do this, but i expect the expected result in the bottom of this question.
Tables are as follows
baseline table
ID NAME LAST_NAME DISTRICT PPI_SCORE
1 A A A 10
2 B B B 23
3 C C C 90
4 D D D 47
endline table
baseline_ID Enterprise Market PPI_SCORE
3 Bee Keeping Yes
2 Poultry No 74
1 Agriculture Yes 80
PPI_SCORE_TOOKUP table
ppi_start ppi_end national national_150 national_200 usaid
0 4 100 100 100 100
10 14 66.1 89.5 96.5 39.2
5 9 68.8 90.2 96.7 44.4
15 19 59.5 89.1 97.2 35.2
20 24 51.3 85.5 96.4 28.8
25 29 43.5 81.1 93.2 20
30 34 31.9 74.5 90.4 13.6
35 39 24.6 66.9 87.3 7.9
40 44 15.2 58 82.8 4.5
45 49 11.4 47.9 73.4 4.2
50 54 6 37.2 68.4 2.6
55 59 2.7 26.1 61.3 0.5
60 64 0.9 21 50.4 0.5
65 69 0 14.3 37.1 0
70 74 3 14.3 29.2 0
75 79 0 1.4 5.1 0
80 84 0 0 9.5 0
85 89 0 0 15.2 0
90 94 0 0 0 0
95 100 0 0 0 0
Expected Result
Your query can be made in the following way:
SELECT *
FROM baseline b
LEFT JOIN endline e ON b.id = e.baseline_ID
LEFT JOIN PPI_SCORE_TOOKUP ppi ON b.PPI_SCORE BETWEEN ppi.ppi_start AND ppi.ppi_end
LEFT JOIN PPI_SCORE_TOOKUP ppi2 ON e.PPI_SCORE BETWEEN ppi2.ppi_start AND ppi2.ppi_end
This matches your id's from the baseline table with the baseline_ID's from the endline table, keeping possible null values from baseline. It then matches the PPI_SCORE from baseline with ppi_start and ppi_end from PPI_SCORE_TOOKUP. Then we join the PPI_SCORE from endline with and ppi_end.
By replacing * with whatever fields you want to have.
See fiddle for a working example

SQL WHERE clause not filtering out null column

I don't understand why I'm getting certain results when I run a SQL query. This is the query:
SELECT A.flag, B.type, B.aID
FROM A
LEFT JOIN B ON B.aID = A.aID
WHERE A.startDate = '2013-01-07'
AND (A.flag = 1 OR B.type IS NOT NULL)
aID is the primary key on table A.
This is the result I get:
flag type aID
---- ---- ----
0 NULL NULL
I would have expected there to be no results. I am confused because A.flag is not 1 and B.type is null, which seems contrary to my WHERE clause. Note that there was no match for this row on table B, since B.aID is null in the result. When I run the query with only one of A.flag = 1 and B.type IS NOT NULL instead of both, no results are returned instead of one result.
Curiously, when I replace B.type in the SELECT statement with ISNULL(B.type, 'X'), no results are returned. The same happens when I add AND B.type IS NOT NULL to the LEFT JOIN.
Why am I getting this result?
Edit: Sample data
Here is a query that gets rows from table A using two different start dates:
SELECT * FROM A
WHERE A.startDate IN ('2013-01-07', '2012-11-23')
I get the following results (leaving out 6 columns for clarity):
aID cID sID psID startDate flag
------- ---- ---- ---- ----------------------- -----
23844 75 72 86 2013-01-07 00:00:00 0
23940 75 72 86 2012-11-23 00:00:00 1
21061 76 73 87 2012-11-23 00:00:00 0
21293 76 74 88 2012-11-23 00:00:00 0
21477 77 75 89 2012-11-23 00:00:00 0
21711 78 76 90 2012-11-23 00:00:00 0
21944 79 77 91 2012-11-23 00:00:00 0
22176 80 78 92 2012-11-23 00:00:00 0
22410 81 79 93 2012-11-23 00:00:00 0
22643 82 80 94 2012-11-23 00:00:00 0
23344 83 81 95 2012-11-23 00:00:00 0
22876 84 82 96 2012-11-23 00:00:00 0
23639 85 83 97 2012-11-23 00:00:00 0
23109 89 84 98 2012-11-23 00:00:00 0
(14 row(s) affected)
Using the aID that we found for 2013-01-07, we can see from this next query that there is no entry in table B that corresponds to that start date.
SELECT * FROM B
WHERE B.aID = 23844
This returns no results.
Using the aID's that we found for 2012-11-23, we can see that all but one of these have a corresponding entry in table B.
SELECT * FROM B
WHERE B.aID IN (23940,
21061,
21293,
21477,
21711,
21944,
22176,
22410,
22643,
23344,
22876,
23639,
23109)
Results:
bID aID type duration iMinutes
----- ------ ----- -------- ---------
5836 21061 M 0 0
5893 21293 M 0 0
5916 21477 M 0 0
5975 21711 M 0 0
6033 21944 M 0 0
6092 22176 M 0 0
6150 22410 M 0 0
6208 22643 M 0 0
6266 22876 M 0 0
6530 23109 M 0 0
6382 23344 M 0 0
6478 23639 M 0 0
(12 row(s) affected)
Turns out this is happening because there are no service packs installed on this database (2008). Other databases with SP1 installed do not have this issue.
It's because you are doing a LEFT (OUTER) JOIN. Perhaps an INNER JOIN is what you had in mind?

How to vertically flip data in BQL?

Title. For example, I have below data:
Key1 Key2 Cost Qty_LIFO Date
Red A 2 19 1/4/2018
Red A 3 18 1/3/2018
Red C 4 7 1/2/2018
Red A 5 16 1/1/2018
Blu B 21 91 1/4/2018
Blu B 31 81 1/3/2018
Blu D 41 70 1/2/2018
Blu D 51 60 1/1/2018
The goal is to transform the data to look like below. Flip the quantity column, while also taking into account the Keys/categories
Key1 Key2 Cost Qty_FIFO Date
Red A 2 16 1/4/2018
Red A 3 18 1/3/2018
Red C 4 7 1/2/2018
Red A 5 19 1/1/2018
Blu B 21 81 1/4/2018
Blu B 31 91 1/3/2018
Blu D 41 60 1/2/2018
Blu D 51 70 1/1/2018
or like this (Qty_FIFO is flipped and added to the first example at the top):
Key1 Key2 Cost Qty_LIFO Qty_FIFO Date
Red A 2 19 16 1/4/2018
Red A 3 18 18 1/3/2018
Red C 4 7 7 1/2/2018
Red A 5 16 19 1/1/2018
Blu B 21 91 81 1/4/2018
Blu B 31 81 91 1/3/2018
Blu D 41 70 60 1/2/2018
Blu D 51 60 70 1/1/2018
The purpose of this is to calculate LIFO and FIFO costs.
I need to take Qty_LIFO column (which is sorted by date, descending), flip it vertically (so the data becomes Date ASC), and re-add it to the table without changing the sorting of the Costs column.
Basically, I need to pair the newest Cost data with the oldest Qty data and continue from there.
This is a hack, which only works if you have access to row_number()
CREATE TABLE existing_qry(
Key1 VARCHAR(3) NOT NULL
,Key2 VARCHAR(1) NOT NULL
,Cost INTEGER NOT NULL
,Qty_LIFO INTEGER NOT NULL
,Date DATE NOT NULL
);
INSERT INTO existing_qry(Key1,Key2,Cost,Qty_LIFO,Date) VALUES ('Red','A',2,19,'1/4/2018');
INSERT INTO existing_qry(Key1,Key2,Cost,Qty_LIFO,Date) VALUES ('Red','A',3,18,'1/3/2018');
INSERT INTO existing_qry(Key1,Key2,Cost,Qty_LIFO,Date) VALUES ('Red','C',4,7,'1/2/2018');
INSERT INTO existing_qry(Key1,Key2,Cost,Qty_LIFO,Date) VALUES ('Red','A',5,16,'1/1/2018');
INSERT INTO existing_qry(Key1,Key2,Cost,Qty_LIFO,Date) VALUES ('Blu','B',21,91,'1/4/2018');
INSERT INTO existing_qry(Key1,Key2,Cost,Qty_LIFO,Date) VALUES ('Blu','B',31,81,'1/3/2018');
INSERT INTO existing_qry(Key1,Key2,Cost,Qty_LIFO,Date) VALUES ('Blu','D',41,70,'1/2/2018');
INSERT INTO existing_qry(Key1,Key2,Cost,Qty_LIFO,Date) VALUES ('Blu','D',51,60,'1/1/2018');
with cte as (
select
*
, row_number() over(partition by key1 order by date ASC) rn_asc
, row_number() over(partition by key1 order by date DESC) rn_desc
from existing_qry
)
select
t.Key1, t.Key2, t.Cost, t.Qty_LIFO, flip.Qty_LIFO as Qty_FIFO, t.Date, t.rn_asc, t.rn_desc
from cte as t
inner join cte as flip on t.key1 = flip.key1 and t.rn_asc = flip.rn_desc
It calculates 2 numbers for each row in opposite date order, then aligns rows by requiring these to be equal through a self join. This has the impact of reversing the LIFO numbers (or "flipping" that column).
Key1 Key2 Cost Qty_LIFO Qty_FIFO Date rn_asc rn_desc
---- ------ ------ ------ ---------- ---------- --------------------- -------- ---------
1 Blu B 21 91 60 04.01.2018 00:00:00 4 1
2 Blu B 31 81 70 03.01.2018 00:00:00 3 2
3 Blu D 41 70 81 02.01.2018 00:00:00 2 3
4 Blu D 51 60 91 01.01.2018 00:00:00 1 4
5 Red A 2 19 16 04.01.2018 00:00:00 4 1
6 Red A 3 18 7 03.01.2018 00:00:00 3 2
7 Red C 4 7 18 02.01.2018 00:00:00 2 3
8 Red A 5 16 19 01.01.2018 00:00:00 1 4
https://rextester.com/LQBVD29253