Pandas - cross columns reference - pandas

My data is a bit complicated, I separate into 2 sections: (A) Explain data, (B) Desire output
(A) - Explain data:
My data as follow:
comp date adj_date val
0 a 1999-12-31 NaT 50
1 a 2000-01-31 NaT 51
2 a 2000-02-29 NaT 52
3 a 2000-03-31 NaT 53
4 a 2000-04-30 NaT 54
5 a 2000-05-31 NaT 55
6 a 2000-06-30 NaT 56
----------------------------------
7 a 2000-07-31 2000-01-31 57
8 a 2000-08-31 2000-02-29 58
9 a 2000-09-30 2000-03-31 59
10 a 2000-10-31 2000-04-30 60
11 a 2000-11-30 2000-05-31 61
12 a 2000-12-31 2000-06-30 62
13 a 2001-01-31 2000-07-31 63
14 a 2001-02-28 2000-08-31 64
15 a 2001-03-31 2000-09-30 65
16 a 2001-04-30 2000-10-31 66
17 a 2001-05-31 2000-11-30 67
18 a 2001-06-30 2000-12-31 68
----------------------------------
19 a 2001-07-31 2001-01-31 69
20 a 2001-08-31 2001-02-28 70
21 a 2001-09-30 2001-03-31 71
22 a 2001-10-31 2001-04-30 72
23 a 2001-11-30 2001-05-31 73
24 a 2001-12-31 2001-06-30 74
25 a 2002-01-31 2001-07-31 75
26 a 2002-02-28 2001-08-31 76
27 a 2002-03-31 2001-09-30 77
28 a 2002-04-30 2001-10-31 78
29 a 2002-05-31 2001-11-30 79
30 a 2002-06-30 2001-12-31 80
----------------------------------
31 a 2002-07-31 2002-01-31 81
32 a 2002-08-31 2002-02-28 82
33 a 2002-09-30 2002-03-31 83
34 a 2002-10-31 2002-04-30 84
35 a 2002-11-30 2002-05-31 85
36 a 2002-12-31 2002-06-30 86
37 a 2003-01-31 2002-07-31 87
38 a 2003-02-28 2002-08-31 88
39 a 2003-03-31 2002-09-30 89
40 a 2003-04-30 2002-10-31 90
41 a 2003-05-31 2002-11-30 91
42 a 2003-06-30 2002-12-31 92
----------------------------------
date: is the actual date, as end of month.
adj_date = date + MonthEnd(-6)
val: is given value
I want to create new column val_new where:
it is referencing to val of previous year December
val_new is then applied to date as from date.July to date.(year+1).June, Or equivalently in adj_date it is from adj_date.Jan to adj_date.Dec
(B) - Desire Output:
comp date adj_date val val_new
0 a 1999-12-31 NaT 50 NaN
1 a 2000-01-31 NaT 51 NaN
2 a 2000-02-29 NaT 52 NaN
3 a 2000-03-31 NaT 53 NaN
4 a 2000-04-30 NaT 54 NaN
5 a 2000-05-31 NaT 55 NaN
6 a 2000-06-30 NaT 56 NaN
-------------------------------------------
7 a 2000-07-31 2000-01-31 57 50.0
8 a 2000-08-31 2000-02-29 58 50.0
9 a 2000-09-30 2000-03-31 59 50.0
10 a 2000-10-31 2000-04-30 60 50.0
11 a 2000-11-30 2000-05-31 61 50.0
12 a 2000-12-31 2000-06-30 62 50.0
13 a 2001-01-31 2000-07-31 63 50.0
14 a 2001-02-28 2000-08-31 64 50.0
15 a 2001-03-31 2000-09-30 65 50.0
16 a 2001-04-30 2000-10-31 66 50.0
17 a 2001-05-31 2000-11-30 67 50.0
18 a 2001-06-30 2000-12-31 68 50.0
-------------------------------------------
19 a 2001-07-31 2001-01-31 69 62.0
20 a 2001-08-31 2001-02-28 70 62.0
21 a 2001-09-30 2001-03-31 71 62.0
22 a 2001-10-31 2001-04-30 72 62.0
23 a 2001-11-30 2001-05-31 73 62.0
24 a 2001-12-31 2001-06-30 74 62.0
25 a 2002-01-31 2001-07-31 75 62.0
26 a 2002-02-28 2001-08-31 76 62.0
27 a 2002-03-31 2001-09-30 77 62.0
28 a 2002-04-30 2001-10-31 78 62.0
29 a 2002-05-31 2001-11-30 79 62.0
30 a 2002-06-30 2001-12-31 80 62.0
-------------------------------------------
31 a 2002-07-31 2002-01-31 81 74.0
32 a 2002-08-31 2002-02-28 82 74.0
33 a 2002-09-30 2002-03-31 83 74.0
34 a 2002-10-31 2002-04-30 84 74.0
35 a 2002-11-30 2002-05-31 85 74.0
36 a 2002-12-31 2002-06-30 86 74.0
37 a 2003-01-31 2002-07-31 87 74.0
38 a 2003-02-28 2002-08-31 88 74.0
39 a 2003-03-31 2002-09-30 89 74.0
40 a 2003-04-30 2002-10-31 90 74.0
41 a 2003-05-31 2002-11-30 91 74.0
42 a 2003-06-30 2002-12-31 92 74.0
-------------------------------------------
I have two solutions, but both comes at a cost:
Solution 1: to create sub_dec dataframe where we take val of Dec each year. Then merge back to main data. This one works fine, but I don't like this solution because our actual data will involve a lot of merge, and it is not easy and convenient to keep track of all those merges.
Solution 2: (1) I create a lag by shift(7), (2) set other adj_date but Jan to None, (3) then use groupby with ffill. This solution works nicely, but if there is any missing rows, or the date is not continuous, then the entire output is wrong
create adj_year:
data['adj_year'] = data['adj_date'].dt.year
cross referencing by shift(7):
data['val_new'] = data.groupby('comp')['val'].shift(7)
setting other adj_date except Jan to be None:
data.loc[data['adj_date'].dt.month != 1, 'val_new'] = None
using ffill to fill in None by each group of ['comp', 'adj_year']:
data['val_new'] = data.groupby(['comp', 'adj_year'])['val_new'].ffill()
If you have any suggestion to overcome the drawback of Solution 02, or any other new solution is appreciated.
Thank you

You can use Timedelta with correct conversion from seconds to months, according to your needs ,
check these two resources for more info:
https://docs.python.org/3/library/datetime.html
pandas: function equivalent to SQL's datediff()?

Related

Efficient way to do 3 left outer join in SQL for two tables with drop and rename some columns

I have below 2 tables ("Goal_n_Cat_Tab" and "Sales_Tab" with 50+ other columns for each table).
-- Goal_n_Cat_Tab
id1 ID col1 Goal col2 col3 Date category
85643 G-00001 671 NaN 793 500 2021-06-13 J302022
85644 G-00001 5 56 8 89 2021-06-13 J302022
85644 G-00002 5 78 8 89 2021-06-14 J302022
8564312 G-00002 NaN NaN NaN NaN 2021-06-13 J302022
8564314 G-00002 4 89 43 44 2021-06-14 J302022
85645 G-00001 60 73 610 60 2021-06-15 J302022
856442 G-00001 60 NaN 610 60 2021-06-15 J302022
85646 G-00001 NaN NaN NaN NaN 2021-06-13 J302022
8564318 G-00001 0 NaN 0 5 2021-06-16 J302022
85647 G-00001 6 NaN 16 6 2021-06-13 J302020
85648 G-00002 3 NaN 23 3 2021-06-13 J302021
85649 G-00002 2 34 72 2 2021-06-13 J302021
85655 G-00002 4 NaN 48 4 2021-06-16 J302022
56731 G-00002 32 7 3 13 2021-05-23 J302021
34566 G-00003 3 84 28 12 2021-05-13 J302021
78931 G-00003 1 NaN 5 14 2021-03-26 J302022
78931 G-00003 23 5 3 98 2021-05-13 J302021
--Sales_Tab
RA Oa Goal userid Sa Ai ID col1 col2 brand
0.96 5 771 85640 10 2 G-00001 1087 993 ABC
0.96 16 213 85844 32 38 G-00004 1200 8 cbc
7.25 15 42 35644 14 4 G-00002 173 8 ads
0.96 46 435 5564312 32 81 G-00002 1151 876 efn
0 3 NA 8564314 90 0 G-00002 1158 43 hae
8.7 51 822 856451 10 21 G-00002 1135 610 ABC
8.7 19 129 836442 32 3 G-00003 1169 610 cbc
1.48 45 732 36892 16 41 G-00001 1157 0 ads
0.96 46 542 8564318 7 81 G-00002 1134 0 efn
6.92 30 386 85647 67 14 G-00003 1084 146 hae
1.48 45 35 85648 16 41 G-00004 196 123 ABC
0.65 46 675 82749 7 81 G-00002 1104 172 cbc
8.7 30 772 85655 67 0 G-00002 1172 748 ads
0 56 NA 521731 17 0 G-00002 1104 43 efn
3.09 52.71 728 34566 32.44 33.31 G-00003 1139 278 hae
3.08 55.43 56 78931 32.79 33.87 G-00003 1184 128 ABC
3.07 58.14 679 78931 33.15 34.44 G-00003 1107 329 cbc
Here is what I'm going to do.
I want to do the 1st left outer join base on "id1" in "Goal_n_Cat_Tab"(LeftSide) and "userid" in "Sales_Tab"(rightSide). First I have to change "Goal" column name to "GoalwithBrand", "col1" to "col1Brand" and drop "col2" column in "Sales_Tab" then do the left outer join. I need all other columns to in result and is there a way to do this in one step or waht would be the Efficient way to do this since I have 50+ columns in my "Sales_Tab" table?
The 2nd left outer join should be base on the "ID" column. Same as above I want to change "Goal" column name to "GoalwithBrand", "col1" to "col1Brand" and drop "col2" column in "Sales_Tab" then do the left outer join.
The 3rd left outer join should be base on the "ID" column.
Thanks in advance!
Hive is funny when it comes to efficiency. It really depends on the data itself and how it is stored in hdfs across many servers. It also depends on a lot of other things related to the number of resources the process has access to - so its hard to say for sure.
However I think you can write this all in 1 query, with each "output" being a subquery, that you can tailor to only have columns that you want to use in subsequent steps.
Once you hit a performance wall, it might be necessary to have each step write into an intermediate table, but I like to avoid that until it's absolutely necessary.

How do i delete rows that contains just a specific word in pandas [duplicate]

I have the following DataFrame:
daysago line_race rating rw wrating
line_date
2007-03-31 62 11 56 1.000000 56.000000
2007-03-10 83 11 67 1.000000 67.000000
2007-02-10 111 9 66 1.000000 66.000000
2007-01-13 139 10 83 0.880678 73.096278
2006-12-23 160 10 88 0.793033 69.786942
2006-11-09 204 9 52 0.636655 33.106077
2006-10-22 222 8 66 0.581946 38.408408
2006-09-29 245 9 70 0.518825 36.317752
2006-09-16 258 11 68 0.486226 33.063381
2006-08-30 275 8 72 0.446667 32.160051
2006-02-11 475 5 65 0.164591 10.698423
2006-01-13 504 0 70 0.142409 9.968634
2006-01-02 515 0 64 0.134800 8.627219
2005-12-06 542 0 70 0.117803 8.246238
2005-11-29 549 0 70 0.113758 7.963072
2005-11-22 556 0 -1 0.109852 -0.109852
2005-11-01 577 0 -1 0.098919 -0.098919
2005-10-20 589 0 -1 0.093168 -0.093168
2005-09-27 612 0 -1 0.083063 -0.083063
2005-09-07 632 0 -1 0.075171 -0.075171
2005-06-12 719 0 69 0.048690 3.359623
2005-05-29 733 0 -1 0.045404 -0.045404
2005-05-02 760 0 -1 0.039679 -0.039679
2005-04-02 790 0 -1 0.034160 -0.034160
2005-03-13 810 0 -1 0.030915 -0.030915
2004-11-09 934 0 -1 0.016647 -0.016647
I need to remove the rows where line_race is equal to 0. What's the most efficient way to do this?
If I'm understanding correctly, it should be as simple as:
df = df[df.line_race != 0]
But for any future bypassers you could mention that df = df[df.line_race != 0] doesn't do anything when trying to filter for None/missing values.
Does work:
df = df[df.line_race != 0]
Doesn't do anything:
df = df[df.line_race != None]
Does work:
df = df[df.line_race.notnull()]
just to add another solution, particularly useful if you are using the new pandas assessors, other solutions will replace the original pandas and lose the assessors
df.drop(df.loc[df['line_race']==0].index, inplace=True)
If you want to delete rows based on multiple values of the column, you could use:
df[(df.line_race != 0) & (df.line_race != 10)]
To drop all rows with values 0 and 10 for line_race.
In case of multiple values and str dtype
I used the following to filter out given values in a col:
def filter_rows_by_values(df, col, values):
return df[~df[col].isin(values)]
Example:
In a DataFrame I want to remove rows which have values "b" and "c" in column "str"
df = pd.DataFrame({"str": ["a","a","a","a","b","b","c"], "other": [1,2,3,4,5,6,7]})
df
str other
0 a 1
1 a 2
2 a 3
3 a 4
4 b 5
5 b 6
6 c 7
filter_rows_by_values(df, "str", ["b","c"])
str other
0 a 1
1 a 2
2 a 3
3 a 4
Though the previous answer are almost similar to what I am going to do, but using the index method does not require using another indexing method .loc(). It can be done in a similar but precise manner as
df.drop(df.index[df['line_race'] == 0], inplace = True)
The best way to do this is with boolean masking:
In [56]: df
Out[56]:
line_date daysago line_race rating raw wrating
0 2007-03-31 62 11 56 1.000 56.000
1 2007-03-10 83 11 67 1.000 67.000
2 2007-02-10 111 9 66 1.000 66.000
3 2007-01-13 139 10 83 0.881 73.096
4 2006-12-23 160 10 88 0.793 69.787
5 2006-11-09 204 9 52 0.637 33.106
6 2006-10-22 222 8 66 0.582 38.408
7 2006-09-29 245 9 70 0.519 36.318
8 2006-09-16 258 11 68 0.486 33.063
9 2006-08-30 275 8 72 0.447 32.160
10 2006-02-11 475 5 65 0.165 10.698
11 2006-01-13 504 0 70 0.142 9.969
12 2006-01-02 515 0 64 0.135 8.627
13 2005-12-06 542 0 70 0.118 8.246
14 2005-11-29 549 0 70 0.114 7.963
15 2005-11-22 556 0 -1 0.110 -0.110
16 2005-11-01 577 0 -1 0.099 -0.099
17 2005-10-20 589 0 -1 0.093 -0.093
18 2005-09-27 612 0 -1 0.083 -0.083
19 2005-09-07 632 0 -1 0.075 -0.075
20 2005-06-12 719 0 69 0.049 3.360
21 2005-05-29 733 0 -1 0.045 -0.045
22 2005-05-02 760 0 -1 0.040 -0.040
23 2005-04-02 790 0 -1 0.034 -0.034
24 2005-03-13 810 0 -1 0.031 -0.031
25 2004-11-09 934 0 -1 0.017 -0.017
In [57]: df[df.line_race != 0]
Out[57]:
line_date daysago line_race rating raw wrating
0 2007-03-31 62 11 56 1.000 56.000
1 2007-03-10 83 11 67 1.000 67.000
2 2007-02-10 111 9 66 1.000 66.000
3 2007-01-13 139 10 83 0.881 73.096
4 2006-12-23 160 10 88 0.793 69.787
5 2006-11-09 204 9 52 0.637 33.106
6 2006-10-22 222 8 66 0.582 38.408
7 2006-09-29 245 9 70 0.519 36.318
8 2006-09-16 258 11 68 0.486 33.063
9 2006-08-30 275 8 72 0.447 32.160
10 2006-02-11 475 5 65 0.165 10.698
UPDATE: Now that pandas 0.13 is out, another way to do this is df.query('line_race != 0').
The given answer is correct nontheless as someone above said you can use df.query('line_race != 0') which depending on your problem is much faster. Highly recommend.
Another way of doing it. May not be the most efficient way as the code looks a bit more complex than the code mentioned in other answers, but still alternate way of doing the same thing.
df = df.drop(df[df['line_race']==0].index)
One of the efficient and pandaic way is using eq() method:
df[~df.line_race.eq(0)]
I compiled and run my code. This is accurate code. You can try it your own.
data = pd.read_excel('file.xlsx')
If you have any special character or space in column name you can write it in '' like in the given code:
data = data[data['expire/t'].notnull()]
print (date)
If there is just a single string column name without any space or special
character you can directly access it.
data = data[data.expire ! = 0]
print (date)
Adding one more way to do this.
df = df.query("line_race!=0")
There are various ways to achieve that. Will leave below various options, that one can use, depending on specificities of one's use case.
One will consider that OP's dataframe is stored in the variable df.
Option 1
For OP's case, considering that the only column with values 0 is the line_race, the following will do the work
df_new = df[df != 0].dropna()
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
However, as that is not always the case, would recommend checking the following options where one will specify the column name.
Option 2
tshauck's approach ends up being better than Option 1, because one is able to specify the column. There are, however, additional variations depending on how one wants to refer to the column:
For example, using the position in the dataframe
df_new = df[df[df.columns[2]] != 0]
Or by explicitly indicating the column as follows
df_new = df[df['line_race'] != 0]
One can also follow the same login but using a custom lambda function, such as
df_new = df[df.apply(lambda x: x['line_race'] != 0, axis=1)]
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 3
Using pandas.Series.map and a custom lambda function
df_new = df['line_race'].map(lambda x: x != 0)
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 4
Using pandas.DataFrame.drop as follows
df_new = df.drop(df[df['line_race'] == 0].index)
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 5
Using pandas.DataFrame.query as follows
df_new = df.query('line_race != 0')
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 6
Using pandas.DataFrame.drop and pandas.DataFrame.query as follows
df_new = df.drop(df.query('line_race == 0').index)
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 7
If one doesn't have strong opinions on the output, one can use a vectorized approach with numpy.select
df_new = np.select([df != 0], [df], default=np.nan)
[Out]:
[['2007-03-31' 62 11.0 56 1.0 56.0]
['2007-03-10' 83 11.0 67 1.0 67.0]
['2007-02-10' 111 9.0 66 1.0 66.0]
['2007-01-13' 139 10.0 83 0.880678 73.096278]
['2006-12-23' 160 10.0 88 0.793033 69.786942]
['2006-11-09' 204 9.0 52 0.636655 33.106077]
['2006-10-22' 222 8.0 66 0.581946 38.408408]
['2006-09-29' 245 9.0 70 0.518825 36.317752]
['2006-09-16' 258 11.0 68 0.486226 33.063381]
['2006-08-30' 275 8.0 72 0.446667 32.160051]
['2006-02-11' 475 5.0 65 0.164591 10.698423]]
This can also be converted to a dataframe with
df_new = pd.DataFrame(df_new, columns=df.columns)
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.0 56.0
1 2007-03-10 83 11.0 67 1.0 67.0
2 2007-02-10 111 9.0 66 1.0 66.0
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
With regards to the most efficient solution, that would depend on how one wants to measure efficiency. Assuming that one wants to measure the time of execution, one way that one can go about doing it is with time.perf_counter().
If one measures the time of execution for all the options above, one gets the following
method time
0 Option 1 0.00000110000837594271
1 Option 2.1 0.00000139995245262980
2 Option 2.2 0.00000369996996596456
3 Option 2.3 0.00000160001218318939
4 Option 3 0.00000110000837594271
5 Option 4 0.00000120000913739204
6 Option 5 0.00000140001066029072
7 Option 6 0.00000159995397552848
8 Option 7 0.00000150001142174006
However, this might change depending on the dataframe one uses, on the requirements (such as hardware), and more.
Notes:
There are various suggestions on using inplace=True. Would suggest reading this: https://stackoverflow.com/a/59242208/7109869
There are also some people with strong opinions on .apply(). Would suggest reading this: When should I (not) want to use pandas apply() in my code?
If one has missing values, one might want to consider as well pandas.DataFrame.dropna. Using the option 2, it would be something like
df = df[df['line_race'] != 0].dropna()
There are additional ways to measure the time of execution, so I would recommend this thread: How do I get time of a Python program's execution?
Just adding another way for DataFrame expanded over all columns:
for column in df.columns:
df = df[df[column]!=0]
Example:
def z_score(data,count):
threshold=3
for column in data.columns:
mean = np.mean(data[column])
std = np.std(data[column])
for i in data[column]:
zscore = (i-mean)/std
if(np.abs(zscore)>threshold):
count=count+1
data = data[data[column]!=i]
return data,count
Just in case you need to delete the row, but the value can be in different columns.
In my case I was using percentages so I wanted to delete the rows which has a value 1 in any column, since that means that it's the 100%
for x in df:
df.drop(df.loc[df[x]==1].index, inplace=True)
Is not optimal if your df have too many columns.
so many options provided(or maybe i didnt pay much attention to it, sorry if its the case), but no one mentioned this:
we can use this notation in pandas: ~ (this gives us the inverse of the condition)
df = df[~df["line_race"] == 0]
It doesn't make much difference for simple example like this, but for complicated logic, I prefer to use drop() when deleting rows because it is more straightforward than using inverse logic. For example, delete rows where A=1 AND (B=2 OR C=3).
Here's a scalable syntax that is easy to understand and can handle complicated logic:
df.drop( df.query(" `line_race` == 0 ").index)
You can try using this:
df.drop(df[df.line_race != 0].index, inplace = True)
.

pandas how to filter and slice with multiple conditions

Using pandas, how do I return dataframe filtered by value of 2 in 'GEN' column, value 20 in 'AGE' column and exclude columns with name 'GEN' and 'BP'? Thanks in advance:)
AGE GEN BMI BP S1 S2 S3 S4 S5 S6 Y
59 2 32.1 101 157 93.2 38 4 4.8598 87 151
48 1 21.6 87 183 103.2 70 3 3.8918 69 75
72 2 30.5 93 156 93.6 41 4 4.6728 85 141
24 1 25.3 84 198 131.4 40 5 4.8903 89 206
50 1 23 101 192 125.4 52 4 4.2905 80 135
23 1 22.6 89 139 64.8 61 2 4.1897 68 97
20 2 22 90 160 99.6 50 3 3.9512 82 138
66 2 26.2 114 255 185 56 4.5 4.2485 92 63
60 2 32.1 83 179 119.4 42 4 4.4773 94 110
20 1 30 85 180 93.4 43 4 5.3845 88 310
You can do this -
cols = df.columns[~df.columns.isin(['GEN','BP'])]
out=df.loc[(df['GEN'] == 2) & (df['AGE'] == 20),cols]
OR
out=df.query("'GEN'==2 and 'AGE'==20").loc[cols]

How can I translate this nested query into R dplyr?

I'm a newbie in R and I'm trying to translate the following nested query using dplyr:
SELECT * FROM DAT
where concat(code, datcomp) IN
(SELECT concat(code, max(datcomp)) from DAT group by code)
DAT is a data frame containing several hundreds columns.
code is a not-unique numeric field
datcomp is a string like 'YYYY-MM-DDTHH24:MI:SS'
What I'm trying to do is extracting from data frame the most recent timestamp for each code.
Eg: given
code datcomp
1 1005 2019-06-12T09:13:47
2 1005 2019-06-19T16:15:46
3 1005 2019-06-17T21:46:02
4 1005 2019-06-17T17:52:01
5 1005 2019-06-24T13:10:05
6 1015 2019-05-02T10:33:13
7 1030 2019-06-11T14:58:16
8 1030 2019-06-20T09:50:20
9 2008 2019-05-17T18:43:34
10 2008 2019-05-28T15:16:50
11 3030 2019-05-24T09:51:30
12 3032 2019-05-30T16:40:03
13 3032 2019-05-21T09:34:27
14 3062 2019-05-17T16:10:53
15 3062 2019-06-20T16:45:51
16 3069 2019-07-01T17:54:59
17 3069 2019-07-09T12:39:56
18 3069 2019-07-09T17:45:09
19 3069 2019-07-17T14:31:01
20 3069 2019-06-24T13:42:27
21 3104 2019-06-05T14:47:38
22 3104 2019-05-17T15:18:47
23 3111 2019-06-06T15:52:51
24 3111 2019-07-01T09:50:33
25 3127 2019-04-16T16:04:59
26 3127 2019-05-15T11:49:29
27 3249 2019-06-21T18:24:14
28 3296 2019-07-01T17:44:54
29 3311 2019-06-10T11:05:20
30 3311 2019-06-21T12:11:05
31 3311 2019-06-19T11:36:47
32 3332 2019-05-13T09:38:21
33 3440 2019-06-11T12:53:07
34 3440 2019-05-17T17:40:19
35 3493 2019-04-18T11:18:37
36 5034 2019-06-06T15:24:04
37 5034 2019-05-31T11:39:17
38 5216 2019-05-20T17:16:07
39 5216 2019-05-14T15:08:15
40 5385 2019-05-17T13:19:54
41 5387 2019-05-13T09:33:31
42 5387 2019-05-07T10:49:14
43 5387 2019-05-15T10:38:25
44 5696 2019-06-10T16:16:49
45 5696 2019-06-11T14:47:00
46 5696 2019-06-13T17:10:36
47 6085 2019-05-21T10:15:58
48 6085 2019-06-03T11:22:34
49 6085 2019-05-29T11:25:37
50 6085 2019-05-31T12:52:42
51 6175 2019-05-13T17:17:48
52 6175 2019-05-27T09:58:04
53 6175 2019-05-23T10:32:21
54 6230 2019-06-21T14:28:11
55 6230 2019-06-11T16:00:48
56 6270 2019-05-28T08:57:38
57 6270 2019-05-17T16:17:04
58 10631 2019-05-22T09:46:51
59 10631 2019-07-03T10:41:41
60 10631 2019-06-06T11:52:42
What I need is
code datcomp
1 1005 2019-06-24T13:10:05
2 1015 2019-05-02T10:33:13
3 1030 2019-06-20T09:50:20
4 2008 2019-05-28T15:16:50
5 3030 2019-05-24T09:51:30
6 3032 2019-05-30T16:40:03
7 3062 2019-06-20T16:45:51
8 3069 2019-07-17T14:31:01
9 3104 2019-06-05T14:47:38
10 3111 2019-07-01T09:50:33
11 3127 2019-05-15T11:49:29
12 3249 2019-06-21T18:24:14
13 3296 2019-07-01T17:44:54
14 3311 2019-06-21T12:11:05
15 3332 2019-05-13T09:38:21
16 3440 2019-06-11T12:53:07
17 3493 2019-04-18T11:18:37
18 5034 2019-06-06T15:24:04
19 5216 2019-05-20T17:16:07
20 5385 2019-05-17T13:19:54
21 5387 2019-05-15T10:38:25
22 5696 2019-06-13T17:10:36
23 6085 2019-06-03T11:22:34
24 6175 2019-05-27T09:58:04
25 6230 2019-06-21T14:28:11
26 6270 2019-05-28T08:57:38
27 10631 2019-07-03T10:41:41
Thank you in advance.
a more generalized version - group, then sort so that you get whatever you want first, then slice (which would allow you to take the nth value from each group as sorted):
dati %>%
group_by(code) %>%
arrange(desc(datcomp)) %>%
slice(1) %>%
ungroup()

Convert all colors in pdf into one specific color

I'm working on a php project where I need to perform some pdf manipulation.
I need to "convert" all colors of a vector file(pdf) into one very specific color (a spot color in my case.)
Here is an illustrated example
The input file can vary, and it can contain any color (so I can't just convert all "red" or "green" to my target color).
I have a fair idea on how to do it on a raster image using imagemagick's composite, but I'm unsure if it's even possible with a vector image.
My first approach was to create a template pdf, with a filled rectangle in the desired color. My hope was then to use ghostscript to somehow apply the input file as a mask on said template. But I assume this wouldn't be possible as vector files are different from raster images.
My second approach was to use ghostscript to convert all colors (regardless of colorspace) into the desired color. But after extensive googling, I've only found solutions that convert from one colorspace to another (i.e. sRGB to CMYK, CMYK to gray-scale, etc.)
I'm not much of a designer, so perhaps I am simply lacking the proper "terms" for these "actions".
TL;DR
I am looking for a library/tool that can help me "convert" all colors of a vector file(pdf) into one very specific color.
The input file may vary (various shapes and colors), but will always be a pdf file without any fonts.
Output must remain as a vector file (read, no rasterisation.)
I have root access on a VPS running linux (centos7, I assume that is irrelevant.)
You could try rasterising at a high resolution and converting the colours with ImageMagick, then re-vectorising with potrace
So, if you had a PDF, you would do:
convert -density 288 document.pdf ...
As you have provided a PNG, I will do:
convert image.png -fill black -fuzz 10% +opaque white pgm:- | potrace -b svg -o result.svg -
which gives this SVG:
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20010904//EN"
"http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd">
<svg version="1.0" xmlns="http://www.w3.org/2000/svg"
width="800.000000pt" height="450.000000pt" viewBox="0 0 800.000000 450.000000"
preserveAspectRatio="xMidYMid meet">
<metadata>
Created by potrace 1.13, written by Peter Selinger 2001-2015
</metadata>
<g transform="translate(0.000000,450.000000) scale(0.100000,-0.100000)"
fill="#000000" stroke="none">
<path d="M4800 4324 c0 -50 -2 -55 -17 -49 -84 35 -140 -17 -130 -119 7 -77
70 -120 122 -82 16 11 21 11 33 0 7 -8 18 -12 23 -9 5 4 9 76 9 161 0 147 -1
154 -20 154 -18 0 -20 -7 -20 -56z m-22 -90 c46 -32 18 -134 -38 -134 -25 0
-40 29 -40 79 0 39 19 71 43 71 7 0 23 -7 35 -16z"/>
<path d="M4926 4358 c-9 -12 -16 -35 -16 -50 0 -18 -5 -28 -15 -28 -8 0 -15
-7 -15 -15 0 -8 7 -15 15 -15 12 0 15 -17 15 -89 0 -89 6 -105 38 -94 8 3 12
31 12 94 0 88 0 89 25 89 16 0 25 6 25 15 0 9 -9 15 -25 15 -21 0 -25 5 -25
30 0 30 7 34 43 30 13 -1 18 4 15 17 -5 29 -72 30 -92 1z"/>
<path d="M3347 4364 c-4 -4 -7 -16 -7 -26 0 -14 6 -19 23 -16 14 2 22 10 22
23 0 20 -25 32 -38 19z"/>
<path d="M4170 4310 c0 -23 -4 -30 -20 -30 -11 0 -20 -7 -20 -15 0 -8 9 -15
20 -15 18 0 20 -7 20 -80 0 -74 2 -81 25 -96 32 -21 75 -12 75 17 0 16 -4 19
-21 14 -30 -10 -39 9 -39 83 l0 62 30 0 c20 0 30 5 30 15 0 10 -10 15 -30 15
-27 0 -30 3 -30 30 0 23 -4 30 -20 30 -16 0 -20 -7 -20 -30z"/>
<path d="M3345 4278 c-3 -8 -4 -59 -3 -114 2 -80 6 -99 18 -99 12 0 15 19 15
109 0 79 -4 111 -12 113 -7 3 -15 -2 -18 -9z"/>
<path d="M3453 4283 c-9 -3 -13 -34 -13 -108 0 -74 4 -105 13 -108 29 -10 37
6 37 78 0 57 4 75 18 88 46 42 72 10 72 -91 0 -54 4 -71 15 -76 22 -8 26 10
23 104 -3 77 -5 84 -31 104 -24 17 -32 19 -59 8 -18 -6 -38 -8 -47 -3 -9 5
-22 6 -28 4z"/>
<path d="M3687 4283 c-4 -3 -7 -71 -7 -150 l0 -143 25 0 c23 0 25 4 25 45 0
42 2 45 19 35 33 -17 61 -11 92 19 24 25 29 37 29 81 0 95 -51 141 -119 107
-25 -13 -31 -13 -35 -1 -6 15 -19 18 -29 7z m122 -47 c19 -22 23 -78 9 -106
-29 -55 -88 -26 -88 43 0 62 48 100 79 63z"/>
<path d="M3927 4284 c-4 -4 -7 -45 -7 -91 0 -76 2 -86 25 -108 27 -28 61 -32
92 -10 18 13 22 13 27 0 3 -8 12 -12 21 -9 13 5 15 24 13 113 -3 98 -4 106
-23 106 -18 0 -20 -8 -23 -75 -4 -94 -28 -128 -72 -100 -10 6 -16 34 -20 91
-5 75 -15 101 -33 83z"/>
<path d="M4432 4282 c-9 -7 -12 -43 -10 -148 3 -136 4 -139 26 -142 20 -3 22
1 22 41 l0 45 35 -11 c31 -9 39 -8 63 10 37 27 54 83 42 136 -15 68 -64 94
-120 63 -20 -12 -26 -12 -35 0 -6 8 -15 10 -23 6z m122 -54 c22 -31 20 -81 -3
-109 -19 -23 -21 -23 -48 -9 -24 13 -28 23 -31 62 -3 39 1 49 20 62 30 22 44
20 62 -6z"/>
<path d="M4310 4096 c0 -30 30 -43 47 -21 16 23 5 45 -23 45 -19 0 -24 -5 -24
-24z"/>
<path d="M4046 3795 l-67 -141 -227 -12 c-418 -22 -765 -74 -1127 -167 -612
-157 -1080 -387 -1387 -684 -214 -205 -323 -393 -359 -615 -16 -101 -6 -270
20 -361 136 -461 637 -856 1409 -1111 152 -51 434 -125 583 -154 l66 -13 -30
-169 c-16 -93 -27 -171 -24 -174 2 -3 124 58 271 135 l266 140 80 -9 c44 -5
197 -14 339 -21 259 -12 617 -3 844 21 l88 9 265 -140 c146 -77 268 -138 270
-136 5 4 -41 294 -52 328 -4 13 8 19 58 28 465 89 939 260 1278 461 626 370
880 871 686 1356 -69 174 -228 375 -415 526 -517 418 -1411 697 -2402 750
l-226 12 -71 141 -70 140 -66 -140z m-202 -407 c-31 -62 -119 -241 -196 -398
-76 -156 -140 -285 -142 -287 -3 -3 -799 -120 -1156 -170 -102 -14 -188 -29
-193 -32 -4 -4 102 -113 235 -242 133 -129 353 -344 489 -479 l248 -245 -45
-260 c-25 -143 -58 -332 -73 -420 l-27 -160 -41 2 c-61 2 -333 68 -515 124
-674 209 -1153 533 -1334 905 -59 121 -77 209 -71 349 5 137 35 235 109 359
58 97 206 261 311 344 463 366 1242 627 2097 701 69 6 141 13 160 15 19 1 72
4 118 4 l82 2 -56 -112z m906 86 c760 -79 1420 -283 1875 -581 864 -566 763
-1326 -245 -1840 -266 -136 -602 -253 -942 -328 -92 -21 -173 -35 -181 -32 -9
3 -20 44 -31 114 -10 59 -42 248 -72 419 l-54 311 213 210 c116 115 337 331
489 479 153 148 274 271 270 275 -4 3 -106 20 -227 37 -452 64 -1118 162
-1120 164 -6 6 -195 387 -291 587 l-104 214 137 -7 c76 -4 203 -14 283 -22z
m-424 -2761 c137 -73 200 -111 193 -118 -14 -14 -794 -14 -809 1 -7 7 49 41
192 117 112 58 207 107 212 107 5 0 100 -48 212 -107z"/>
<path d="M1815 3669 c-46 -47 -113 -80 -221 -111 -62 -17 -106 -22 -204 -22
-137 0 -185 12 -221 58 -48 61 -211 80 -449 53 -118 -14 -400 -63 -408 -72 -3
-3 28 -145 32 -145 1 0 55 11 120 25 181 37 365 58 481 53 98 -3 105 -5 125
-30 113 -144 579 -119 806 44 50 35 109 108 97 118 -5 4 -33 21 -63 38 l-55
31 -40 -40z"/>
<path d="M7647 575 c-66 -79 -247 -137 -432 -138 -134 0 -170 10 -221 61 -18
17 -53 37 -84 46 -70 21 -238 21 -395 0 -122 -15 -364 -60 -372 -68 -5 -5 17
-119 26 -133 4 -7 47 -2 121 13 181 37 358 56 477 52 l108 -3 37 -37 c120
-117 482 -110 720 13 75 40 168 123 168 151 0 10 -110 80 -122 77 -2 0 -16
-16 -31 -34z"/>
</g>
</svg>
which looks like this as a PNG (because StackOverflow doesn't allow SVG images AFAIK):
You can make all the PATHs your preferred shade of green by editing the SVG, like this:
sed 's/path /path fill="#7CBE89" /' black.svg > green.svg
You could do this with Ghostscript, but you would need some PostScript programming experience.
Essentially you want to override all the setcolor/setcolorspace operations by looking at each setcolor operation, checking the colour space and values to see if its your target colour and, if it is, set the colorspace and values to your desired target.
The various PDF operations to set colour space and values are all defined in ghostpdl/Resource/Init/pdf_draw.ps. You'll need to modify the definitions of:
/G and /g (stroke and fill colours in DeviceGray)
/RG and /rg (stroke and fill colours in DeviceRGB)
/K and /k (stroke and fill colours in DeviceCMYK)
/SC and /sc (stroke and fill colours in Indexed, CalGray, CalRGB or Lab)
/SCN and /scn (stroke and fill colours in Pattern, Separation, DeviceN or ICCBased)
There are quite a few wrinkles in there;
You can probably ignore Pattern spaces and just deal with any colours that are set by the pattern itself.
For SC/sc and /SCN/scn you need to figure out whether the colour specified is the target colour, assuming your target can be specified in these spaces. Note that /Indexed is particularly interesting as it can have a base space of any of the other spaces, so you need to look and see.
Finally note that images (bitmaps) are specified differently, and altering those would be much harder.
Depending on the exact nature of the requirement (ie what space/colours constitute valid targets) this could be quite a lengthy task, and it will require someone with PostScript programming ability to write it.
Oh, and on a final note, have you considered transparency ? That can specify the blending colour space too, which might mean that after you had substituted the colour, it would be blended in a different colour space, resulting in your careful substitution disappearing.
Lest you think this unlikely I should mention that a number of PDF producers create files with transparency groups in them, even when no actual transparency operations take place.