Related
I want to create a new column in the clin dataframe based on the following conditions:
1 if vals>=2*365 or is NAN
otherwise 0
I then assign the new column name as SURV.
import numpy as np
vals = clin['days_to_death'].astype(np.float32)
# non-LTS is 0, LTS is 1
surv = [1 if ( v>=2*365 or np.isnan(v) ) else 0 for v in vals ]
clin['SURV'] = clin.apply(surv, axis=1)
Traceback:
SpecificationError: Function names must be unique if there is no new column names assigned
---------------------------------------------------------------------------
SpecificationError Traceback (most recent call last)
<ipython-input-31-603dee8413ce> in <module>
5 # non-LTS is 0, LTS is 1
6 surv = [1 if ( v>=2*365 or np.isnan(v) ) else 0 for v in vals ]
----> 7 clin['SURV'] = clin.apply(surv, axis=1)
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7766 kwds=kwds,
7767 )
-> 7768 return op.get_result()
7769
7770 def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/apply.py in get_result(self)
146 # multiple values for keyword argument "axis"
147 return self.obj.aggregate( # type: ignore[misc]
--> 148 self.f, axis=self.axis, *self.args, **self.kwds
149 )
150
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/frame.py in aggregate(self, func, axis, *args, **kwargs)
7572 axis = self._get_axis_number(axis)
7573
-> 7574 relabeling, func, columns, order = reconstruct_func(func, **kwargs)
7575
7576 result = None
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/aggregation.py in reconstruct_func(func, **kwargs)
93 # there is no reassigned name
94 raise SpecificationError(
---> 95 "Function names must be unique if there is no new column names "
96 "assigned"
97 )
SpecificationError: Function names must be unique if there is no new column names assigned
clin
clin = pd.DataFrame([[1, '466', '47', 0, '90'],
[1, '357', '54', 1, '80'],
[1, '108', '72', 1, '60'],
[1, '254', '51', 0, '80'],
[1, '138', '78', 1, '80'],
[0, nan, '67', 0, '60']], columns=['vital_status', 'days_to_death', 'age_at_initial_pathologic_diagnosis',
'gender', 'karnofsky_performance_score'], index=['TCGA-06-1806', 'TCGA-06-5408', 'TCGA-06-5410', 'TCGA-06-5411',
'TCGA-06-5412', 'TCGA-06-5413'])
Expected output:
vital_status
days_to_death
age_at_initial_pathologic_diagnosis
gender
karnofsky_performance_score
SURV
TCGA-06-1806
1
466
47
0
90
0
TCGA-06-5408
1
357
54
1
80
0
TCGA-06-5410
1
108
72
1
60
0
TCGA-06-5411
1
254
51
0
80
0
TCGA-06-5412
1
138
78
1
80
0
TCGA-06-5413
0
nan
67
0
60
1
Make a new column of all 0's and then update the column with your desired parameters.
clin['SURV'] = 0
clin.loc[pd.to_numeric(clin.days_to_death).ge(2*365) | clin.days_to_death.isna(), 'SURV'] = 1
print(clin)
Output:
vital_status days_to_death age_at_initial_pathologic_diagnosis gender karnofsky_performance_score SURV
TCGA-06-1806 1 466 47 0 90 0
TCGA-06-5408 1 357 54 1 80 0
TCGA-06-5410 1 108 72 1 60 0
TCGA-06-5411 1 254 51 0 80 0
TCGA-06-5412 1 138 78 1 80 0
TCGA-06-5413 0 NaN 67 0 60 1
I am trying to upload 2 csv files from 2 different url's to a dynamodb table. I am using pandas to get the desired data from the url's and merge the 2 dataframes into a df3. I'm running into an issue when I use put_item to update the database. I have tried converting the pandas series into strings but that doesn't seem to work either.
Here is the lambda function:
import csv
import pandas as pd
import io
import requests
import numpy as np
import boto3
from datetime import datetime
import json
from decimal import Decimal
def lambda_handler(event, context):
url1 = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us.csv"
url2 = "https://raw.githubusercontent.com/datasets/covid-19/master/data/time-series-19-covid-combined.csv"
df1 = pd.read_csv(url1)
df1 = pd.DataFrame(df1)
df1 = df1.drop(0)
df2 = pd.read_csv(url2, delimiter=',')
df2 = pd.DataFrame(df2)
df2['Recovered'] = df2['Recovered'].fillna(0).astype(np.int64)
df2 = df2.loc[df2['Country/Region'] == 'US', 'Recovered']
df2 = df2.reset_index(drop=True)
df2.index = np.arange(1, len(df2) + 1)
df3 = df1.join(df2)
region = 'eu-west-2'
try:
dyndb = boto3.client('dynamodb', region_name=region)
firstrecord = True
for row in df1:
if firstrecord:
firstrecord = False
continue
cases = df3['cases']
date = df3['date']
deaths = df3['deaths']
Recovered = df3['Recovered']
response = dyndb.put_item(TableName='covidstatstable',
Item={
'cases': {'N': cases},
'date': {'S': date},
'deaths': {'N': deaths},
'Recovered': {'N': Recovered},
})
print('Put succeeded:')
except Exception as e:
print(str(e))
and here is the function logs:
Test Event Name
test1
Response
null
Function Logs
START RequestId: d5192687-d8ef-41dc-bd2f-efbcb1c7ff6f Version: $LATEST
Parameter validation failed:
Invalid type for parameter Item.cases.N, value: 1 1
2 1
3 2
4 3
5 5
...
607 42066372
608 42274530
609 42404490
610 42551956
611 42678374
Name: cases, Length: 611, dtype: int64, type: <class 'pandas.core.series.Series'>, valid types: <class 'str'>
Invalid type for parameter Item.date.S, value: 1 2020-01-22
2 2020-01-23
3 2020-01-24
4 2020-01-25
5 2020-01-26
...
607 2021-09-19
608 2021-09-20
609 2021-09-21
610 2021-09-22
611 2021-09-23
Name: date, Length: 611, dtype: object, type: <class 'pandas.core.series.Series'>, valid types: <class 'str'>
Invalid type for parameter Item.deaths.N, value: 1 0
2 0
3 0
4 0
5 0
...
607 673939
608 676191
609 678556
610 681343
611 684488
Name: deaths, Length: 611, dtype: int64, type: <class 'pandas.core.series.Series'>, valid types: <class 'str'>
Invalid type for parameter Item.Recovered.N, value: 1 0
2 0
3 0
4 0
5 0
..
607 0
608 0
609 0
610 0
611 0
Name: Recovered, Length: 611, dtype: int64, type: <class 'pandas.core.series.Series'>, valid types: <class 'str'>
END RequestId: d5192687-d8ef-41dc-bd2f-efbcb1c7ff6f
REPORT RequestId: d5192687-d8ef-41dc-bd2f-efbcb1c7ff6f Duration: 902.87 ms Billed Duration: 903 ms Memory Size: 512 MB Max Memory Used: 172 MB Init Duration: 1806.56 ms
Request ID
d5192687-d8ef-41dc-bd2f-efbcb1c7ff6f
I would like to sort a dataframe by certain priority rules.
I've achieved this in the code below but I think this is a very hacky solution.
Is there a more proper Pandas way of doing this?
import pandas as pd
import numpy as np
df=pd.DataFrame({"Primary Metric":[80,100,90,100,80,100,80,90,90,100,90,90,80,90,90,80,80,80,90,90,100,80,80,100,80],
"Secondary Metric Flag":[0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0],
"Secondary Value":[15, 59, 70, 56, 73, 88, 83, 64, 12, 90, 64, 18, 100, 79, 7, 71, 83, 3, 26, 73, 44, 46, 99,24, 20],
"Final Metric":[222, 883, 830, 907, 589, 93, 479, 498, 636, 761, 851, 349, 25, 405, 132, 491, 253, 318, 183, 635, 419, 885, 305, 258, 924]})
Primary_List=list(np.unique(df['Primary Metric']))
Primary_List.sort(reverse=True)
df_sorted=pd.DataFrame()
for p in Primary_List:
lol=df[df["Primary Metric"]==p]
lol.sort_values(["Secondary Metric Flag"],ascending = False)
pt1=lol[lol["Secondary Metric Flag"]==1].sort_values(by=['Secondary Value', 'Final Metric'], ascending=[False, False])
pt0=lol[lol["Secondary Metric Flag"]==0].sort_values(["Final Metric"],ascending = False)
df_sorted=df_sorted.append(pt1)
df_sorted=df_sorted.append(pt0)
df_sorted
The priority rules are:
First sort by the 'Primary Metric', then by the 'Secondary Metric
Flag'.
If the 'Secondary Metric Flag' ==1, sort by 'Secondary Value' then
the 'Final Metric'
If ==0, go right for the 'Final Metric'.
Appreciate any feedback.
You do not need for loop and groupby here , just split them and sort_values
df1=df.loc[df['Secondary Metric Flag']==1].sort_values(by=['Primary Metric','Secondary Value', 'Final Metric'], ascending=[True,False, False])
df0=df.loc[df['Secondary Metric Flag']==0].sort_values(["Primary Metric","Final Metric"],ascending = [True,False])
df=pd.concat([df1,df0]).sort_values('Primary Metric')
sorted with loc
def k(t):
p, s, v, f = df.loc[t]
return (-p, -s, -s * v, -f)
df.loc[sorted(df.index, key=k)]
Primary Metric Secondary Metric Flag Secondary Value Final Metric
9 100 1 90 761
5 100 1 88 93
1 100 1 59 883
3 100 1 56 907
23 100 1 24 258
20 100 0 44 419
13 90 1 79 405
19 90 1 73 635
7 90 1 64 498
11 90 1 18 349
10 90 0 64 851
2 90 0 70 830
8 90 0 12 636
18 90 0 26 183
14 90 0 7 132
15 80 1 71 491
21 80 1 46 885
17 80 1 3 318
24 80 0 20 924
4 80 0 73 589
6 80 0 83 479
22 80 0 99 305
16 80 0 83 253
0 80 0 15 222
12 80 0 100 25
sorted with itertuples
def k(t):
_, p, s, v, f = t
return (-p, -s, -s * v, -f)
idx, *tups = zip(*sorted(df.itertuples(), key=k))
pd.DataFrame(dict(zip(df, tups)), idx)
lexsort
p = df['Primary Metric']
s = df['Secondary Metric Flag']
v = df['Secondary Value']
f = df['Final Metric']
a = np.lexsort([
-p, -s, -s * v, -f
][::-1])
df.iloc[a]
Construct New DataFrame
df.mul([-1, -1, 1, -1]).assign(
**{'Secondary Value': lambda d: d['Secondary Metric Flag'] * d['Secondary Value']}
).pipe(
lambda d: df.loc[d.sort_values([*d]).index]
)
I am struggling for days trying to solve this puzzle.
I have this code that calculates time IN & OUT as decimal hours: (6 min = 0.1 hr)~(60 min = 1.0 hr)
NSUInteger unitFlag = NSCalendarUnitHour | NSCalendarUnitMinute;
NSDateComponents *components = [calendar components:unitFlag
fromDate:self.outT
toDate:self.inT
options:0];
NSInteger hours = [components hour];
NSInteger minutes = [components minute];
if (minutes <0) (minutes -= 60*-1) && (hours -=1);
if (hours<0 && minutes<0)(hours +=24)&& (minutes -=60*-1);
if(hours<0 && minutes>0)(hours +=24)&& (minutes = minutes);
if(hours <0 && minutes == 00)(hours +=24)&&(minutes = minutes);
if(minutes >0)(minutes = (minutes/6));
self.blockDecimalLabel.text = [NSString stringWithFormat:#"%d.%d", (int)hours, (int)minutes];
The green lines show what the code does, what I am looking for is to round the minutes like the blue lines, 1,2 minutes round down to the next decimal hr, 3,4,5 minutes round up to the next decimal hr
What I am try to achieve is:
If the result is 11 minutes the code return 0.1 then only after 12 minutes it will return 0.2. What I am trying to do is if the result is 8 the code returns 01, but if it is 9 will round to the next decimal that is 0.2 and so on.The objective is do not loose maximum of 5 minutes in each multiple of 6 in worst cases. Doing this the maximum lost will be 3 minutes in average
Any input is more than welcome :)
Cheers
Your goals seem incoherent to me. However, I tried this:
let beh = NSDecimalNumberHandler(
roundingMode: .RoundPlain, scale: 1, raiseOnExactness: false,
raiseOnOverflow: false, raiseOnUnderflow: false, raiseOnDivideByZero: false
)
for t in 0...60 {
let div = Double(t)/60.0
let deci = NSDecimalNumber(double: div)
let deci2 = deci.decimalNumberByRoundingAccordingToBehavior(beh)
let result = deci2.doubleValue
println("min: \(t) deci: \(result)")
}
The output seems pretty much what you are asking for:
min: 0 deci: 0.0
min: 1 deci: 0.0
min: 2 deci: 0.0
min: 3 deci: 0.1
min: 4 deci: 0.1
min: 5 deci: 0.1
min: 6 deci: 0.1
min: 7 deci: 0.1
min: 8 deci: 0.1
min: 9 deci: 0.2
min: 10 deci: 0.2
min: 11 deci: 0.2
min: 12 deci: 0.2
min: 13 deci: 0.2
min: 14 deci: 0.2
min: 15 deci: 0.3
min: 16 deci: 0.3
min: 17 deci: 0.3
min: 18 deci: 0.3
min: 19 deci: 0.3
min: 20 deci: 0.3
min: 21 deci: 0.4
min: 22 deci: 0.4
min: 23 deci: 0.4
min: 24 deci: 0.4
min: 25 deci: 0.4
min: 26 deci: 0.4
min: 27 deci: 0.5
min: 28 deci: 0.5
min: 29 deci: 0.5
min: 30 deci: 0.5
min: 31 deci: 0.5
min: 32 deci: 0.5
min: 33 deci: 0.6
min: 34 deci: 0.6
min: 35 deci: 0.6
min: 36 deci: 0.6
min: 37 deci: 0.6
min: 38 deci: 0.6
min: 39 deci: 0.7
min: 40 deci: 0.7
min: 41 deci: 0.7
min: 42 deci: 0.7
min: 43 deci: 0.7
min: 44 deci: 0.7
min: 45 deci: 0.8
min: 46 deci: 0.8
min: 47 deci: 0.8
min: 48 deci: 0.8
min: 49 deci: 0.8
min: 50 deci: 0.8
min: 51 deci: 0.9
min: 52 deci: 0.9
min: 53 deci: 0.9
min: 54 deci: 0.9
min: 55 deci: 0.9
min: 56 deci: 0.9
min: 57 deci: 1.0
min: 58 deci: 1.0
min: 59 deci: 1.0
min: 60 deci: 1.0
I have a data file which content two columns. One of them have periodic variation of whom the max and min are different in each period :
a 3
b 4
c 5
d 4
e 3
f 2
g 1
h 2
i 3
j 4
k 5
l 6
m 5
n 4
o 3
p 2
q 1
r 0
s 1
t 2
u 3
We can find that in the 1st period (from a to i): max = 5, min = 1. In the 2nd period (from i to u) : max = 6, min = 0.
Using awk, I can only print the max and min of all second column, but I cannot print these values min and max after each period. That means I wish to obtain results like this :
period min max
1 1 5
2 0 6
Here is what I did :
{
nb_lignes = 21
period = 9
nb_periodes = int(nb_lignes/period)
}
{
for (j = 0; j <= nb_periodes; j++)
{ if (NR == (1 + period*j)) {{max=$2 ; min=$2}}
for (i = (period*j); i <= (period*(j+1)); i++)
{
if (NR == i)
{
if ($2 >= max) {max = $2}
if ($2 <= min) {min = $2}
{print "Min: "min,"Max: "max,"Ligne: " NR}
}
}
}
}
#END { print "Min: "min,"Max: "max }
However the result is far away from what I search for :
Min: 3 Max: 3 Ligne: 1
Min: 3 Max: 4 Ligne: 2
Min: 3 Max: 5 Ligne: 3
Min: 3 Max: 5 Ligne: 4
Min: 3 Max: 5 Ligne: 5
Min: 2 Max: 5 Ligne: 6
Min: 1 Max: 5 Ligne: 7
Min: 1 Max: 5 Ligne: 8
Min: 1 Max: 5 Ligne: 9
Min: 1 Max: 5 Ligne: 9
Min: 4 Max: 4 Ligne: 10
Min: 4 Max: 5 Ligne: 11
Min: 4 Max: 6 Ligne: 12
Min: 4 Max: 6 Ligne: 13
Min: 4 Max: 6 Ligne: 14
Min: 3 Max: 6 Ligne: 15
Min: 2 Max: 6 Ligne: 16
Min: 1 Max: 6 Ligne: 17
Min: 0 Max: 6 Ligne: 18
Min: 0 Max: 6 Ligne: 18
Min: 1 Max: 1 Ligne: 19
Min: 1 Max: 2 Ligne: 20
Min: 1 Max: 3 Ligne: 21
Thank you in advance for you help.
Try something like:
$ awk '
BEGIN{print "period", "min", "max"}
!f{min=$2; max=$2; ++f; next}
{max = ($2>max)?$2:max; min = ($2<min)?$2:min; f++}
f==9{print ++a, min, max; f=0}' file
period min max
1 1 5
2 0 6
When the flag f is not set, you assign the second column to max and min variables and start incrementing your flag.
For each line, check the second column. If it is bigger than our max variable assign that column to max. Like wise, if it is smaller than our min variable, assign it to our min variable. Keep incrementing the flag.
Once the flag reaches 9, print the period number, min and max variables. Reset the flag to 0 and start again afresh from next line.
I've started, so I'll finish. I chose to create an array which contains the minimum and maximum for each period:
awk -v period=9 '
BEGIN { print "period", "min", "max" }
NR % period == 1 { ++i }
!min[i] || $2 < min[i] { min[i] = $2 }
$2 > max[i] { max[i] = $2 }
END { for (i in min) print i, min[i], max[i] }' input
The index i increases every period number of lines (in this case 9). If no value has been set yet or a new minimum/maximum has been found, update the array.
edit: if max[i] has not yet been set then $2 > max[i], so no need to check !max[i].
awk 'BEGIN{print "Period","min","max"}
NR==1||(NR%10==0){mi=ma=$2}
{$2<mi?mi=$2:0;$2>ma?ma=$2:0}
NR%9==0{print ++i,mi,ma}' your_file
Tester here