I want to replace the NAs with the mean of columns for each country with the following code, but I receive the following error message. Any guidance is much appreciated.
mydata <- mydata %>% group_by(country) %>%
mutate(across(c(battles, explosions, protests, riots, violenceac), ~ replace_na(.x,
mean(.x, na.rm = TRUE)))) %>%
ungroup()
Error in mutate():
! Problem while computing ..1 = across(...).
ℹ The error occurred in group 1: country = "Afghanistan".
Caused by error in across():
! Problem while computing column battles.
Caused by error in vec_assign():
! Can't convert from replace to data due to loss of precision.
• Locations: 1
head(mydata,3)
X month year date country code region
income battles
1: 17 Jan 2018 20.01.2018 Afghanistan AFG South Asia Low
income 777
2: 66 Jan 2018 20.01.2018 Albania ALB Europe & Central Asia Upper middle
income NA
3: 114 Jan 2018 20.01.2018 Algeria DZA Middle East & North Africa Lower middle
income 4
explosions protests riots violenceac vaa polstab goveff rol total_cases
total_deaths stringency
1: 281 9 2 38 -1,08 -2,73 -1,52 -1,81 <NA>
<NA> <NA>
2: NA 5 NA NA 0,09 0,08 -0,14 -0,36 <NA>
<NA> <NA>
3: NA 56 13 1 -1,1 -0,86 -0,53 -0,78 <NA>
Related
I have an Excel column that consists of numbers and times that were supposed to all be entered in as only time values. Some are in number form (915) and some are in time form (9:15, which appear as decimals in R). It seems like I managed to get them all to the same format in Excel (year-month-day hh:mm:ss), although the date's are incorrect - which doesn't really matter, I just need the time. However, I can't seem to convert this new column (time - new) back to the correct time value in R (in character or time format).
I'm sure this answer already exists somewhere, I just can't find one that works...
# Returns incorrect time
x$new_time <- times(strftime(x$`time - new`,"%H:%M:%S"))
# Returns all NA
x$new_time2 <- as.POSIXct(as.character(x$`time - new`),
format = '%H:%M:%S', origin = '2011-07-15 13:00:00')
> head(x)
# A tibble: 6 x 8
Year Month Day `Zone - matched with coordinate tab` Time `time - new` new_time new_time2
<dbl> <dbl> <dbl> <chr> <dbl> <dttm> <times> <dttm>
1 2017 7 17 Crocodile 103 1899-12-31 01:03:00 20:03:00 NA
2 2017 7 17 Crocodile 113 1899-12-31 01:13:00 20:13:00 NA
3 2017 7 16 Crocodile 118 1899-12-31 01:18:00 20:18:00 NA
4 2017 7 17 Crocodile 123 1899-12-31 01:23:00 20:23:00 NA
5 2017 7 17 Crocodile 125 1899-12-31 01:25:00 20:25:00 NA
6 2017 7 16 West 135 1899-12-31 01:35:00 20:35:00 NA
Found this answer here:
Extract time from timestamp?
library(lubridate)
# Adding new column to verify times are correct
x$works <- format(ymd_hms(x$`time - new`), "%H:%M:%S")
I want to reshape a dataframe from this shape
Country subject 2017 2018 Frq 2017 Score 2018 Score
Argentina subject 1 12 22 100 50.77214238 51.54316539
Argentina subject 2 68 13 150 66.92805676 67.60645268
to this shape
subject 1 subject 2 subject 3…
Country 2017 2018 Frq 2017 Score 2018 Score 2017 2018 Frq 2017 Score 2018 Score
Argentina 12 22 100 50.77214238 51.54316539 12 22 100 50.77214238 51.54316539
Australia 68 13 150 66.92805676 67.60645268 68 13 150 66.92805676 67.60645268
So Each country has one row. And the values of the column subject are converted into columns
I've tried the following but nothing produced the required results
pd.pivot_table(GCI, index='Country', columns=['subject'],
values=['2017', '2018'], aggfunc='sum', fill_value=0)
Also tried:
pivoted_GCI = GCI[['Country']] #pd.DataFrame()
for key,group_df in GCI.groupby('subject'):
print("the group for '{}' has {} rows".format(key,len(group_df)))
group_df.name = key
group_df = group_df.drop(['subject'], axis=1)
display(group_df)
pivoted_GCI = pd.merge(pivoted_GCI, group_df, on='Country', how='left')
Thanks
You can do unstack
s=df.set_index(['Country','subject']).stack().unstack([1,2]).reset_index()
I'm getting this error:
Error tokenizing data. C error: Expected 2 fields in line 11, saw 3
Code: import webbrowser
website = 'https://en.wikipedia.org/wiki/Winning_percentage'
webbrowser.open(website)
league_frame = pd.read_clipboard()
And the above mentioned comes next.
I believe you need use read_html - returned all parsed tables and select Dataframe by position:
website = 'https://en.wikipedia.org/wiki/Winning_percentage'
#select first parsed table
df1 = pd.read_html(website)[0]
print (df1.head())
Win % Wins Losses Year Team Comment
0 0.798 67 17 1882 Chicago White Stockings best pre-modern season
1 0.763 116 36 1906 Chicago Cubs best 154-game NL season
2 0.721 111 43 1954 Cleveland Indians best 154-game AL season
3 0.716 116 46 2001 Seattle Mariners best 162-game AL season
4 0.667 108 54 1975 Cincinnati Reds best 162-game NL season
#select second parsed table
df2 = pd.read_html(website)[1]
print (df2)
Win % Wins Losses Season Team \
0 0.890 73 9 2015–16 Golden State Warriors
1 0.110 9 73 1972–73 Philadelphia 76ers
2 0.106 7 59 2011–12 Charlotte Bobcats
Comment
0 best 82 game season
1 worst 82-game season
2 worst season statistically
I have this result
ZONE SITE BRAND VALUE
north a a_brand1 10
north a a_brand2 15
north a a_brand3 27
south b b_brand1 17
south b b_brand2 5
south b b_brand3 56
Is there any way to add a column wih the sum grouped by zone, and site? like this: Total site a = 10+15+27 = 52 and total site b = 17+5+56 = 78
ZONE SITE BRAND VALUE TOTAL_IN_SITE
north a a_brand1 10 52
north a a_brand2 15 52
north a a_brand3 27 52
south b b_brand1 17 78
south b b_brand2 5 78
south b b_brand3 56 78
Thanks.
Use sum window function.
select t.*,sum(val) over(partition by zone,site)
from tbl t
I am using Keras to create an LSTM neural-network that can predict the concentration in the blood of a certain drug. I have a dataset with time stamps on which a drug dosage was administered and when the concentration in the blood was measured. These dosage and measurement time stamps are disjoint. Furthermore several other variables are measured at all time steps (both dosage and measurements). These variables are the input for my model along with the dosages (0 when no dosage was given at time t). The observed concentration in the blood is the response variable.
I have normalized all input features using the MinMaxScaler().
Q1:
Now I am wondering, do I need to normalize the time variable that corresponds with all rows as well and give it as input to the model? Or can I leave this variable out since the time steps are equally spaced?
The data looks like:
PatientID Time Dosage DosageRate DrugConcentration
1 0 100 12 NA
1 309 100 12 NA
1 650 100 12 NA
1 1030 100 12 NA
1 1320 NA NA 12
1 1405 100 12 NA
1 1812 90 8 NA
1 2078 90 8 NA
1 2400 NA NA 8
2 0 120 13.5 NA
2 800 120 13.5 NA
2 920 NA NA 16
2 1515 120 13.5 NA
2 1832 120 13.5 NA
2 2378 120 13.5 NA
2 2600 120 13.5 NA
2 3000 120 13.5 NA
2 4400 NA NA 2
As you can see, the time between two consecutive dosages and measurements differs for a patient and between patients, which makes the problem difficult.
Q2:
One approach I can think of is aggregating on measurements intervals and taking the average dosage and SD between two measurements. Then we only predict on time stamps of which we know the observed drug concentration. Would this work, or would we lose to much information?
Q3
A second approach I could think of is create new data points, so that all intervals between dosages are the same and set the dosage and dosage rate at those time points to zero. The disadvantage is then, that we can only calculate the error on the time stamps on which we know the observed drug concentration. How should we tackle this?