how to create pivot/transpose table in hive

how to create pivot/transpose table in hive - hive

select st_date,st_symbol,st_adjclose from table;
2019-02-19 MSFT 107.71
2019-02-20 MSFT 107.15
2019-03-01 AAPL 174.97
2019-01-02 AAPL 157.25
2019-01-08 AMZN 1656.58
2019-01-07 AMZN 1629.51
2019-01-03 GOOGL 1025.47
2019-01-04 GOOGL 1078.07
date MSFT AAPL AMZN GOOGL
2019-02-19 107.71 141.58 141.58 141.58
2019-02-20 107.71 157.25 157.25 157.25
2019-02-22 110.97 157.25 157.25 157.25

If you know in advance which stock symbols you want in your query, then you can do it using MAP :
SELECT st_date,
SUM(t.value_map['MSFT']) AS MSFT,
SUM(t.value_map['AAPL']) AS AAPL,
SUM(t.value_map['AMZN']) AS AMZN,
SUM(t.value_map['GOOGL']) AS GOOGL
FROM (
SELECT st_date,
map(st_symbol,st_adjclose) AS value_map
FROM (
SELECT st_date,
st_symbol,
st_adjclose
FROM TABLE ) x ) t
GROUP BY st_date;
Let me know if it works

Related

Multiindex Value

I've have the following Multiindex and I am trying to get the last time entry for a slice of the MultiIndex.
df.loc['AUDCAD'][-1]
would return 2019-04-30 00:00:00
and
df.loc['USDCHF'][-1]
would return 2021-03-05 23:55:00
open high low close
AUDCAD 2018-12-31 00:00:00 0.95708 0.96276 0.95649 0.95979
2019-01-31 00:00:00 0.96039 0.96309 0.92200 0.94895
2019-02-28 00:00:00 0.94849 0.95800 0.93185 0.93655
2019-03-31 00:00:00 0.93718 0.95632 0.93160 0.94745
2019-04-30 00:00:00 0.94998 0.96147 0.94150 0.94750
USDCHF 2021-03-05 23:35:00 0.93109 0.93119 0.93108 0.93116
2021-03-05 23:40:00 0.93116 0.93150 0.93116 0.93143
2021-03-05 23:45:00 0.93143 0.93147 0.93127 0.93128
2021-03-05 23:50:00 0.93129 0.93134 0.93117 0.93126
2021-03-05 23:55:00 0.93126 0.93141 0.93114 0.93118```

I guess you're looking for :
df.loc[block_name].index[-1]

Sum where values of a column matches without number of rows changing

I am trying to values of a column where values of another column matches. Below is a sample of my data.
DT No_of_records LD_VOY_N LD_VSL_M
2017-05-06 04:00:00.000 7 0002W pqo emzmnwp
2017-05-06 20:00:00.000 6 0002W pqo emzmnwp
2017-05-02 04:00:00.000 1 0007E omq ynzmeoyn
2017-05-01 08:00:00.000 2 0016W rmhp sunhpnw
2017-05-01 12:00:00.000 1 0016W rmhp sunhpnw
2017-05-05 12:00:00.000 2 0019N omq wqmsy
2017-05-06 04:00:00.000 12 0019N omq wqmsy
Below is my desired output
DT No_of_records LD_VOY_N LD_VSL_M Total_no_of_records
2017-05-06 04:00:00.000 7 0002W pqo emzmnwp 13
2017-05-06 20:00:00.000 6 0002W pqo emzmnwp 13
2017-05-02 04:00:00.000 1 0007E omq ynzmeoyn 1
2017-05-01 08:00:00.000 2 0016W rmhp sunhpnw 3
2017-05-01 12:00:00.000 1 0016W rmhp sunhpnw 3
2017-05-05 12:00:00.000 2 0019N omq wqmsy 14
2017-05-06 04:00:00.000 12 0019N omq wqmsy 14
I am trying to find the Total_no_of_records column. Do you have any ideas?

You seem to want a window function by LD_VOY_N:
select t.*,
sum(No_of_records) over (partition by LD_VOY_N) as Total_no_of_records
from t;

select DT,No_of_records,LD_VOY_N,LD_VSL_M ,COUNT(DISTINCT (DT,No_of_records,LD_VOY_N,LD_VSL_M )) as Total_no_of_records from tablename
group by DT,No_of_records,LD_VOY_N,LD_VSL_M

How to group field by id and find the sum?

I have the following data
id starting_point ending_point Date
A 2525 6565 25/05/2017 13:25:00
B 5656 8989 25/01/2017 10:55:00
A 1234 5656 20/05/2017 03:20:00
A 4562 6245 01/02/2017 19:45:00
B 6496 9999 06/12/2016 21:55:00
B 1122 2211 20/03/2017 18:30:00
How to group the data by their id in the ascending order of date and find the sum of first stating point and last starting point. In this case,
Expected output is :
id starting_point ending_point Date Value
A 4562 6245 01/02/2017 19:45:00
A 1234 5656 20/05/2017 03:20:00
A 2525 6565 25/05/2017 13:25:00 4532 + 6565 = 11127
B 6496 9999 06/12/2016 21:55:00
B 1122 2211 20/03/2017 18:30:00 6496 + 2211 = 8707

IIUC:
In [146]: x.groupby('id').apply(lambda df: df['starting_point'].head(1).values[0]
+ df['ending_point'].tail(1).values[0])
Out[146]:
id
A 8770
B 7867
dtype: int64

Data analysis over multiple dataframes? Panels or Multi-index?

When I extract data for multiple stocks using web.DataReader, I'm getting a panel as the output.
import numpy as np
import pandas as pd
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import datetime as dt
import re
startDate = '2010-01-01'
endDate = '2016-09-07'
stocks_query = ['AAPL','OPK']
stocks = web.DataReader(stocks_query, data_source='yahoo',
start=startDate, end=endDate)
stocks = stocks.swapaxes('items','minor_axis')
Leading to an output of :
Dimensions: 2 (items) x 1682 (major_axis) x 6 (minor_axis)
Items axis: AAPL to OPK
Major_axis axis: 2010-01-04 00:00:00 to 2016-09-07 00:00:00
Minor_axis axis: Open to Adj Close
A single dataframe of the panel looks like this
stocks['OPK']
Open High Low Close Volume Adj Close log_return \
Date
2010-01-04 1.80 1.97 1.76 1.95 234500.0 1.95 NaN
2010-01-05 1.64 1.95 1.64 1.93 135800.0 1.93 -0.010309
2010-01-06 1.90 1.92 1.77 1.79 546600.0 1.79 -0.075304
2010-01-07 1.79 1.94 1.76 1.92 138700.0 1.92 0.070110
2010-01-08 1.92 1.94 1.86 1.89 62500.0 1.89 -0.015748
I plan to do a lot of data manipulation across all dataframes adding new columns. comparing two dataframes etc. I was recommended to looking into multi_indexing, as panels are being deprecated.
This is my fist time working with panels.
If I want to add a new column to both dataframes (AAPL, OPK), I had to do something like this:
for i in stocks:
stocks[i]['log_return'] = np.log(stocks[i]['Close']/(stocks[i]['Close'].shift(1)))
If multi_indexing is indeed recommended for working with multiple dataframes, how exactly would I convert my dataframes into a form I can easily work with?
Would I have one main index, with the next-level being the stocks, and the columns would be contained within each stock?
I went through the docs, which gave many examples using tuples o which I didn't get or examples using single dataframes.
http://pandas.pydata.org/pandas-docs/stable/advanced.html
So how exactly do I convert my panel into a multi_index dataframe?

I'd like to extend #piRSquared's answer with some examples:
In [40]: stocks.to_frame()
Out[40]:
AAPL OPK
Date minor
2010-01-04 Open 2.134300e+02 1.80
High 2.145000e+02 1.97
Low 2.123800e+02 1.76
Close 2.140100e+02 1.95
Volume 1.234324e+08 234500.00
Adj Close 2.772704e+01 1.95
2010-01-05 Open 2.146000e+02 1.64
High 2.155900e+02 1.95
Low 2.132500e+02 1.64
Close 2.143800e+02 1.93
... ... ...
2016-09-06 Low 1.075100e+02 9.19
Close 1.077000e+02 9.36
Volume 2.688040e+07 3026900.00
Adj Close 1.066873e+02 9.36
2016-09-07 Open 1.078300e+02 9.39
High 1.087600e+02 9.60
Low 1.070700e+02 9.38
Close 1.083600e+02 9.59
Volume 4.236430e+07 2632400.00
Adj Close 1.073411e+02 9.59
[10092 rows x 2 columns]
but if you want to convert it to MultiIndex DF, it's better to leave the original pandas_datareader Panel as it is:
In [38]: p = web.DataReader(stocks_query, data_source='yahoo', start=startDate, end=endDate)
In [39]: p.to_frame()
Out[39]:
Open High Low Close Volume Adj Close
Date minor
2010-01-04 AAPL 213.429998 214.499996 212.380001 214.009998 123432400.0 27.727039
OPK 1.800000 1.970000 1.760000 1.950000 234500.0 1.950000
2010-01-05 AAPL 214.599998 215.589994 213.249994 214.379993 150476200.0 27.774976
OPK 1.640000 1.950000 1.640000 1.930000 135800.0 1.930000
2010-01-06 AAPL 214.379993 215.230000 210.750004 210.969995 138040000.0 27.333178
OPK 1.900000 1.920000 1.770000 1.790000 546600.0 1.790000
2010-01-07 AAPL 211.750000 212.000006 209.050005 210.580000 119282800.0 27.282650
OPK 1.790000 1.940000 1.760000 1.920000 138700.0 1.920000
2010-01-08 AAPL 210.299994 212.000006 209.060005 211.980005 111902700.0 27.464034
OPK 1.920000 1.940000 1.860000 1.890000 62500.0 1.890000
... ... ... ... ... ... ...
2016-08-31 AAPL 105.660004 106.570000 105.639999 106.099998 29662400.0 105.102360
OPK 9.260000 9.260000 9.070000 9.100000 2793300.0 9.100000
2016-09-01 AAPL 106.139999 106.800003 105.620003 106.730003 26701500.0 105.726441
OPK 9.310000 9.540000 9.190000 9.290000 3515300.0 9.290000
2016-09-02 AAPL 107.699997 108.000000 106.820000 107.730003 26802500.0 106.717038
OPK 9.340000 9.390000 9.160000 9.330000 2061200.0 9.330000
2016-09-06 AAPL 107.900002 108.300003 107.510002 107.699997 26880400.0 106.687314
OPK 9.320000 9.480000 9.190000 9.360000 3026900.0 9.360000
2016-09-07 AAPL 107.830002 108.760002 107.070000 108.360001 42364300.0 107.341112
OPK 9.390000 9.600000 9.380000 9.590000 2632400.0 9.590000
[3364 rows x 6 columns]
How to work with a MultiIndex DF:
In [46]: df = p.to_frame()
In [47]: df.loc[pd.IndexSlice[:, ['AAPL']], :]
Out[47]:
Open High Low Close Volume Adj Close
Date minor
2010-01-04 AAPL 213.429998 214.499996 212.380001 214.009998 123432400.0 27.727039
2010-01-05 AAPL 214.599998 215.589994 213.249994 214.379993 150476200.0 27.774976
2010-01-06 AAPL 214.379993 215.230000 210.750004 210.969995 138040000.0 27.333178
2010-01-07 AAPL 211.750000 212.000006 209.050005 210.580000 119282800.0 27.282650
2010-01-08 AAPL 210.299994 212.000006 209.060005 211.980005 111902700.0 27.464034
2010-01-11 AAPL 212.799997 213.000002 208.450005 210.110003 115557400.0 27.221758
2010-01-12 AAPL 209.189995 209.769995 206.419998 207.720001 148614900.0 26.912110
2010-01-13 AAPL 207.870005 210.929995 204.099998 210.650002 151473000.0 27.291720
2010-01-14 AAPL 210.110003 210.459997 209.020004 209.430000 108223500.0 27.133657
2010-01-15 AAPL 210.929995 211.599997 205.869999 205.930000 148516900.0 26.680198
... ... ... ... ... ... ...
2016-08-24 AAPL 108.570000 108.750000 107.680000 108.029999 23675100.0 107.014213
2016-08-25 AAPL 107.389999 107.879997 106.680000 107.570000 25086200.0 106.558539
2016-08-26 AAPL 107.410004 107.949997 106.309998 106.940002 27766300.0 105.934466
2016-08-29 AAPL 106.620003 107.440002 106.290001 106.820000 24970300.0 105.815591
2016-08-30 AAPL 105.800003 106.500000 105.500000 106.000000 24863900.0 105.003302
2016-08-31 AAPL 105.660004 106.570000 105.639999 106.099998 29662400.0 105.102360
2016-09-01 AAPL 106.139999 106.800003 105.620003 106.730003 26701500.0 105.726441
2016-09-02 AAPL 107.699997 108.000000 106.820000 107.730003 26802500.0 106.717038
2016-09-06 AAPL 107.900002 108.300003 107.510002 107.699997 26880400.0 106.687314
2016-09-07 AAPL 107.830002 108.760002 107.070000 108.360001 42364300.0 107.341112
[1682 rows x 6 columns]

You're going to love this one
stocks.to_frame()

How to select most recent values?

I have a logging table collecting values from many probes:
CREATE TABLE [Log]
(
[LogID] int IDENTITY (1, 1) NOT NULL,
[Minute] datetime NOT NULL,
[ProbeID] int NOT NULL DEFAULT 0,
[Value] FLOAT(24) NOT NULL DEFAULT 0.0,
CONSTRAINT Log_PK PRIMARY KEY([LogID])
)
GO
CREATE INDEX [Minute_ProbeID_Value] ON [Log]([Minute], [ProbeID], [Value])
GO
Typically, each probe generates a value every minute or so. Some example output:
LogID Minute ProbeID Value
====== ================ ======= =====
873875 2014-07-27 09:36 1972 24.4
873876 2014-07-27 09:36 2001 29.7
873877 2014-07-27 09:36 3781 19.8
873878 2014-07-27 09:36 1963 25.6
873879 2014-07-27 09:36 2002 22.9
873880 2014-07-27 09:36 1959 -30.1
873881 2014-07-27 09:36 2005 20.7
873882 2014-07-27 09:36 1234 23.8
873883 2014-07-27 09:36 1970 19.9
873884 2014-07-27 09:36 1991 22.4
873885 2014-07-27 09:37 1958 1.7
873886 2014-07-27 09:37 1962 21.3
873887 2014-07-27 09:37 1020 23.1
873888 2014-07-27 09:38 1972 24.1
873889 2014-07-27 09:38 3781 20.1
873890 2014-07-27 09:38 2001 30
873891 2014-07-27 09:38 2002 23.4
873892 2014-07-27 09:38 1963 26
873893 2014-07-27 09:38 2005 20.8
873894 2014-07-27 09:38 1234 23.7
873895 2014-07-27 09:38 1970 19.8
873896 2014-07-27 09:38 1991 22.7
873897 2014-07-27 09:39 1958 1.4
873898 2014-07-27 09:39 1962 22.1
873899 2014-07-27 09:39 1020 23.1
What is the most efficient way to get just the latest reading for each Probe?
e.g.of desired output (note: the "Value" is not e.g. a Max() or an Avg()):
LogID Minute ProbeID Value
====== ================= ======= =====
873899 27-Jul-2014 09:39 1020 3.1
873894 27-Jul-2014 09:38 1234 23.7
873897 27-Jul-2014 09:39 1958 1.4
873880 27-Jul-2014 09:36 1959 -30.1
873898 27-Jul-2014 09:39 1962 22.1
873892 27-Jul-2014 09:38 1963 26
873895 27-Jul-2014 09:38 1970 19.8
873888 27-Jul-2014 09:38 1972 24.1
873896 27-Jul-2014 09:38 1991 22.7
873890 27-Jul-2014 09:38 2001 30
873891 27-Jul-2014 09:38 2002 23.4
873893 27-Jul-2014 09:38 2005 20.8
873889 27-Jul-2014 09:38 3781 20.1

This is another approach
select *
from log l
where minute =
(select max(x.minute) from log x where x.probeid = l.probeid)
You can compare the execution plan w/ a fiddle - http://sqlfiddle.com/#!3/1d3ff/3/0

Try this:
SELECT T1.*
FROM Log T1
INNER JOIN (SELECT Max(Minute) Minute,
ProbeID
FROM Log
GROUP BY ProbeID)T2
ON T1.ProbeID = T2.ProbeID
AND T1.Minute = T2.Minute
You can play around with it on SQL Fiddle

Your question is: "What is the most efficient way to get just the latest reading for each Probe?"
To really answer this question, you test to test different solutions. I would generally go with the row_number() method suggested by #jyparask. However, the following might have better performance:
select l.*
from log l
where not exists (select 1
from log l2
where l2.probeid = l.probeid and
l2.minute > l.minute
);
For performance, you want an index on log(probeid, minute).
Although not exactly your problem, here is an example of where not exists performs better than other methods on SQL Server.

;WITH MyCTE AS
(
SELECT LogID,
Minute,
ProbeID,
Value,
ROW_NUMBER() OVER(PARTITION BY ProbeID ORDER BY Minute DESC) AS rn
FROM LOG
)
SELECT LogID,
Minute,
ProbeID,
Value
FROM MyCTE
WHERE rn = 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

how to create pivot/transpose table in hive - hive

Related

Multiindex Value

Sum where values of a column matches without number of rows changing

How to group field by id and find the sum?

Data analysis over multiple dataframes? Panels or Multi-index?

How to select most recent values?

Categories

Resources