This question already has answers here:
SQL Server dynamic PIVOT query?
(9 answers)
Closed 6 years ago.
This is what my data looks like:
## ID ## ## Phase ## ## Phase Datet ## ## Phase Temp ## ## TempRecorded ##
23300200 Induction 2016-10-07 30 2016-10-07
23300200 Maintenance 2016-10-07 35 2016-10-07
23300200 Rewarming 2016-10-07 35.5 2016-10-07
I want to have it look like this:
## ID ## ## Induction Phase ## ## Induction Phase Datet ## ## Induction Phase Temp ## ## Induction TempRecorded ## ## ID ## ## Maintenance Phase ## ## Maintenance Phase Datet ## ## Maintenance Phase Temp ## ## Maintenance TempRecorded ##
23300200 Induction 2016-10-07 30 2016-10-07 Maintenance 2016-10-07 35 2016-10-07
My colleague provided me with a solution
SELECT I.*, M.*,R.*
FROM
(select * from #PhaseTemp where ValueText = 'Induction') I
LEFT JOIN (select * from #PhaseTemp where ValueText = 'Maintenance') M ON M.ClientGUID = I.ClientGUID
LEFT JOIN (select * from #PhaseTemp where ValueText = 'Rewarming') R ON R.CLIENTGUID = I.CLIENTGUID
Related
My data is like this:
year_month
user_id
pageviews
visits
2020-03
2
8
3
2021-03
27
4
3
2021-05
23
75
7
2020-05
23
17
7
2020-08
339
253
169
2020-08
892
31
4
2021-08
339
4
3
And I wanted to group by year_month calculating the difference of pageviews and visits from one year(2020) to the next(2021).
So, I was thinking the output should be something similar to (without the content inside the parenthesis):
last_month
diff(pageviews)
diff(visits)
2021-03
-4(4-8)
0(3-3)
2021-05
58(75-17)
0(7-7)
2021-08
-280(4-284)
-170(3-173)
But I'm not sure how to do it vectorized, I was thinking of passing it to pandas and doing it with a for loop, but wanted to learn how to do this kind of things in a vectorized way with pyspark or sparksql that I think they will be much faster.
The main idea is using window function to compare months. Check my comments for more explanations
from pyspark.sql import functions as F
from pyspark.sql import Window as W
(df
# since you'd want to compare month and year separately,
# we have to separate them out using split function
.withColumn('year', F.split('year_month', '-')[0].cast('int'))
.withColumn('month', F.split('year_month', '-')[1].cast('int'))
# you have multiple rows per year_month
# so we have to group and sum the similar records
.groupBy('year', 'month')
.agg(
F.sum('pageviews').alias('pageviews'),
F.sum('visits').alias('visits')
)
# now, you need to compare 2021's months with 2020's months,
# you'd have to use lag window function, pay attention to the orderBy window
.withColumn('prev_pageviews', F.lag('pageviews').over(W.orderBy('month', 'year')))
.withColumn('prev_visits', F.lag('visits').over(W.orderBy('month', 'year')))
# with current pageviews/visits and previous pageviews/visits on the same row
# you can easily calculate the difference between months
.withColumn('diff_pageviews', F.col('pageviews') - F.col('prev_pageviews'))
.withColumn('diff_visits', F.col('visits') - F.col('prev_visits'))
# select only necessary colums and rows
.select('year', 'month', 'diff_pageviews', 'diff_visits')
.where(F.col('year') == 2021)
.show()
)
# Output
# +----+-----+--------------+-----------+
# |year|month|diff_pageviews|diff_visits|
# +----+-----+--------------+-----------+
# |2021| 3| -4| 0|
# |2021| 5| 58| 0|
# |2021| 8| -280| -170|
# +----+-----+--------------+-----------+
I know the duplicated-function of the package dplyr. The problem is that it only returns a logical vector indicating which elements (rows) are duplicates.
I want to get a vector which gives back those rows with the certain elements.
I want to get back all the observations of A and B because they have for the key Name and year duplicated values.
I already have coded this:
>df %>% group_by(Name) %>% filter(any(( ?????)))
but I dont know how to write the last part of code.
Anyone any ideas?
Thanks :)
An option using dplyr can be achieved by grouping on both Name and Year to calculate count. Afterwards group on only Name and filter for groups having any count > 1 (meaning duplicate):
library(dplyr)
df %>% group_by(Name, Year) %>%
mutate(count = n()) %>%
group_by(Name) %>%
filter(any(count > 1)) %>%
select(-count)
# # A tibble: 7 x 3
# # Groups: Name [2]
# Name Year Value
# <chr> <int> <int>
# 1 A 1990 5
# 2 A 1990 3
# 3 A 1991 5
# 4 A 1995 5
# 5 B 2000 0
# 6 B 2000 4
# 7 B 1998 5
Data:
df <- read.table(text =
"Name Year Value
A 1990 5
A 1990 3
A 1991 5
A 1995 5
B 2000 0
B 2000 4
B 1998 5
C 1890 3
C 1790 2",
header = TRUE, stringsAsFactors = FALSE)
I use R to analyze stock data stored in SQLite database. Currently the stock data is daily and stacked like this:
Code,Date,Price
A,2013-05-01,100
A,2013-05-02,102
A,2013-05-03,101
...
B,2013-05-01,53
B,2013-05-02,55
B,2013-05-03,56
...
C,2013-05,02,56
C,2013-05-03,51
...
I need to transform the stacked data to to like this:
Date,A,B,C,...
2013-05-01,100,53,NULL,...
2013-05-02,102,55,56,...
2013-05-03,101,56,51,...
Is there a good way to transform the data by SQL or in R?
Here are two options in R, one with base R and one with the "reshape2" package. Assuming you have read your data into an R data.frame called "mydf":
reshape(mydf, idvar="Date", timevar="Code", direction = "wide")
# Date Price.A Price.B Price.C
# 1 2013-05-01 100 53 NA
# 2 2013-05-02 102 55 56
# 3 2013-05-03 101 56 51
library(reshape2)
dcast(mydf, Date ~ Code, value.var="Price")
# Date A B C
# 1 2013-05-01 100 53 NA
# 2 2013-05-02 102 55 56
# 3 2013-05-03 101 56 51
This creates a zoo time series object with one column per stock:
library(RMySQL)
library(zoo)
# read table from database
con <- dbConnect(MySQL(), ...whatever...)
DF <- dbGetQuery(con, "select * from stocks")
# reshape it and create a multivariate time series from it
z <- read.zoo(DF, split = 1, index = 2)
I'll try and explain how the table is laid out so that what I need might be a bit more clear.
###############################################################
# cid # iid # child cid # child iid # target cid # target iid #
###############################################################
# 112 # 1 # null # null # 116 # 1 #
# 112 # 2 # 112 # 1 # null # null #
# 112 # 3 # 112 # 1 # 116 # 2 #
# 112 # 4 # 112 # 1 # 100 # 3 #
# 112 # 101 # null # null # 116 # 101 #
# 112 # 102 # 112 # 101 # null # null #
# 112 # 103 # 112 # 101 # 116 # 102 #
# 112 # 201 # null # null # 116 # 201 #
# 112 # 202 # 112 # 201 # null # null #
# 112 # 203 # 112 # 201 # 116 # 202 #
# 112 # 301 # null # null # 116 # 301 #
# 112 # 302 # 112 # 301 # null # null #
# 112 # 302 # 112 # 301 # 116 # 302 #
Above there is a cut down representation of the table I'm trying to get data from. Sorry if the layout is a bit crap. Each row here is an object. Each of these objects can have child objects so for example, the first row has no child objects but is linked to the target object. Row two has a child object and isn't linked to a target object, however, it is linked back to row 1 via the child cid and iid which does have a target object. Row three is also linked to row one but it also has a target object so I don't actually want to go back to row one.
Other table
#########################################
# cid # iid # col1 # col2 # col3 # col4 #
#########################################
# 116 # 1 # a # null # 16 # 1 #
# 116 # 2 # b # 1 # 6 # null #
# 116 # 3 # n # 1 # 11 # 2 #
# 116 # 101 # n # 2 # 61 # 3 #
# 116 # 102 # b # null # 161 # 101 #
# 116 # 201 # a # 33 # 312 # 116 #
# 116 # 202 # a # 33 # 312 # 116 #
# 116 # 301 # s # 56 # 1321 # 33 #
# 116 # 302 # r # 6 # 22 # 12 #
Resulting table
###########################################################################################
# cid # iid # child cid # child iid # target cid # target iid # col1 # col2 # col3 # col4 #
###########################################################################################
# 112 # 1 # null # null # 116 # 1 # a # null # 16 # 1 #
# 112 # 2 # 112 # 1 # null # null # a # null # 16 # 1 #
# 112 # 3 # 112 # 1 # 116 # 2 # b # 1 # 6 # null #
# 112 # 4 # 112 # 1 # 100 # 3 # n # 1 # 11 # 2 #
# 112 # 101 # null # null # 116 # 101 # n # 2 # 61 # 3 #
# 112 # 102 # 112 # 101 # null # null # n # 2 # 61 # 3 #
# 112 # 103 # 112 # 101 # 116 # 102 # b # null # 161 # 101 #
# 112 # 201 # null # null # 116 # 201 # a # 33 # 312 # 116 #
# 112 # 202 # 112 # 201 # null # null # a # 33 # 312 # 116 #
# 112 # 203 # 112 # 201 # 116 # 202 # a # 33 # 312 # 116 #
# 112 # 301 # null # null # 116 # 301 # s # 56 # 1321 # 33 #
# 112 # 302 # 112 # 301 # null # null # s # 56 # 1321 # 33 #
# 112 # 302 # 112 # 301 # 116 # 302 # r # 6 # 22 # 12 #
[Just to clarify, in the first table, target cid and iid relate to cid and iid in the other table im linking to it.]
Essentially what I need is to recursively go back through the table until a row has a target object reference.
If a row has both a child c/i id and a target c/i id i just want the target c/i id.
Can anybody point me in the right direction?
I'm slowly reading through
http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries003.htm but I'm finding it a bit confusing. I wouldn't exactly be an expert in the easier SQL queries so recursion is a bit over my head right now.
Thanks
EDIT: Added example of other table and outcome
I dont know what exactly do you need, but
you could start with tihs statment
select cid, iid, level, connect_by_root(target_cid), connect_by_root(target_iid)
from tab
connect by prior cid = child_cid
AND prior iid = child_iid
AND target_cid is null
;
and then filtern the entries you need
select *
from
(
select cid, iid, level, connect_by_root(target_cid) as target_cid, connect_by_root(target_iid) as target_iid
from tab
connect by prior cid = child_cid
AND prior iid = child_iid
AND target_cid is null
)
where target_cid is not null
;
CID IID TARGET_CID TARGET_IID
++++++++++++++++++++++++++++++
112 1 116 1
112 2 116 1
112 3 116 2
112 4 100 3
112 101 116 101
112 102 116 101
112 103 116 102
112 201 116 201
112 202 116 201
112 203 116 202
112 301 116 301
112 302 116 301
112 302 116 302
if there is a name column
name
prashant
ram
then the column values should become like this
name
##############################
# Name | Replaced_value #
##############################
# prashant | XXXXXXXX #
# | #
# ram | XXX #
##############################
It has to replaced by same number of Xs.
You can combine lpad/rpad and length
LPAD('X',LENGTH(InputString),'X')
This will work !!
select name,substr(lpad(name,length(name)+length(name),'X'),1,length(name)) as replaced_name from table