Separating a column data into multiple columns in hive - hive

I have a sample data for a device which contains two controller and it's version. The sample data is as follows:
device_id controller_id versions
123 1 0.1
123 2 0.15
456 2 0.25
143 1 0.35
143 2 0.36
This above data should be in the below format:
device_id 1st_ctrl_id_ver 2nd_ctrl_id_ver
123 0.1 0.15
456 NULL 0.25
143 0.35 0.36
I used the below code which is not working:
select
device_id,
case when controller_id="1" then versions end as 1st_ctrl_id_ver,
case when controller_id="2" then versions end as 2nd_ctrl_id_ver
from device_versions
The ouput which i got is:
device_id 1st_ctrl_id_ver 2nd_ctrl_id_ver
123 0.1 NULL
123 NULL 0.15
456 NULL 0.25
143 0.35 NULL
143 NULL 0.36
I don't want the Null values in each row.Can someone help me in writing the correct code?

To "fold" all lines with a given key to a single line, you have to run an aggregation. Even if you don't really aggregate values in practise.
Something like
select device_id,
MAX(case when controller_id="1" then versions end) as 1st_ctrl_id_ver,
MAX(case when controller_id="2" then versions end) as 2nd_ctrl_id_ver
from device_versions
GROUP BY device_id
But be aware that this code will work if and only if you have at most one entry per controller per device, and any controller with a version higher than 2 will be ignored. In other words it is rather brittle (but you can't do better in SQL anway)

Related

Using case statement to do binning in sql

We have a small table in following structure;
act_id | ratio
123 | 0
234 | 0.001
235 | 0.05
I am trying to use CASE statement to add anew column which does bucketing sth like;
SELECT act_id,
ratio,
CASE WHEN 0 <ratio <= 0.04 THEN '(0,0.04]'
WHEN ratio > 0.04 THEN ratio END AS new_col
FROM table
But it gives the following error; SQL compilation error: error line 50 at position 56 invalid identifier '"(0,0.04]"'. The desired output is
act_id | ratio | new_col
123 | 0. |. 0
234 | 0.001| (0,0.04]
235 | 0.05 | 0.05
may I know how can we use the CASE statement here to put desired strings in this new_col, using open or closed intervals. Helps appreciated.
A case expression returns a single value. So you need to return a single type -- strings. Otherwise, if one branch returns a number and another a string, then the string is converted to a number. And you get an error:
SELECT act_id,
ratio,
(CASE WHEN 0 < ratio <= 0.04 THEN '(0,0.04]'
WHEN ratio > 0.04 THEN ratio::string
END) AS new_col
FROM table
I'm not sure if 0 < ratio <= 0.04 does what you expect. I would recommend:
SELECT act_id,
ratio,
(CASE WHEN ratio > 0 and ratio <= 0.04 THEN '(0,0.04]'
WHEN ratio > 0.04 THEN ratio::string
END) AS new_col
FROM table

Combining aggregate and analytics functions in BigQuery to reduce table size

I am facing an issue that is more about query design than SQL specifics. Essentially, I am trying to reduce a dataset size by carrying out the following transformations described below.
I start with an irregularly sampled timeseries of Voltage measurements:
Seconds
Volts
0.1
2899
0.15
2999
0.17
2990
0.6
3001
0.98
2978
1.2
3000
1.22
3003
3.7
2888
4.1
2900
4.11
3012
4.7
3000
4.8
3000
I bin the data into buckets, where data points that are close to one another fall into the same bucket. In this example, I bin data into 1 second buckets simply by dividing the Seconds column by 1. I also add an ordering number to each group. I use the below query:
WITH qry1 AS (SELECT
Seconds
, Volts
, DIV(CAST(Seconds AS NUMERIC), 1) as BinNo
, Rank() OVER (PARTITION BY DIV(CAST(Seconds AS NUMERIC), 1) ORDER BY Seconds) as BinRank
FROM
project.rawdata
)
Seconds
Volts
BinNo
BinRank
0.1
2899
0
1
0.15
2999
0
2
0.17
2990
0
3
0.6
3001
0
4
0.98
2978
0
5
1.2
3000
1
1
1.22
3003
1
2
3.7
2888
3
1
4.1
2900
4
1
4.11
3012
4
2
4.7
3000
4
3
4.8
3000
4
4
Now comes the part I am struggling with. I am attempting to get the following output from a query acting on the above table. Keeping the time order is important as I need to plot these values on a line style chart. For each group:
Get the first row ('first' meaning earliest Second value)
Get the Max and Min of the Volts field, and associate these with the earliest (can be latest too I guess) Seconds value
Get the last row (last meaning latest Second value)
The conditions for this query are:
If there is only one row in the group, simply assign the Volts value for that row as both the max and the min and only use the single Seconds value for that group
If there are only two rows in the group, simply assign the Volts values for both the max and min to the corresponding first and last Seconds values, respectively.
(Now for the part I am struggling with) If there are three rows or more per group, extract the first and last rows as above, but then also get the max and min over all rows in the group and assign these to the max and min values for an intermediate row between the first and last row. The output would be as below. As mentioned, this step could be associated with any position between the first and last Seconds values, and here I have assigned it to the first Seconds value per group.
Seconds
Volts_min
Volts_max
OrderingCol
0.1
2899
2899
1
0.1
2899
3001
2
0.98
2978
2978
3
1.2
3000
3000
1
1.22
3003
3003
2
3.7
2888
2888
1
4.1
2900
2900
1
4.1
2900
3012
2
4.8
3000
3000
3
This will then allow me to plot these values using a custom charting library which we have without overloading the memory. I can extract the first and last rows per group by using analytics functions and then doing a join, but cannot get the intermediate values. The Ordering Column's goal is to enable me to sort the table before pulling the data to the dashboard. I am attempting to do this in BigQuery as a first preference.
Thanks :)
Below should do it
select Seconds, values.*,
row_number() over(partition by bin_num order by Seconds) as OrderingCol
from (
select *,
case
when row_num = 1 or row_num = rows_count then true
when rows_count > 2 and row_num = 2 then true
end toShow,
case
when row_num = 1 then struct(first_row.Volts as Volts_min, first_row.Volts as Volts_max)
when row_num = rows_count then struct(last_row.Volts as Volts_min, last_row.Volts as Volts_max)
else struct(min_val as Volts_min, max_val as Volts_max)
end values
from (
select *,
div(cast(Seconds AS numeric), 1) as bin_num,
row_number() over win_all as row_num,
count(1) over win_all as rows_count,
min(Volts) over win_all as min_val,
max(Volts) over win_all as max_val,
first_value(t) over win_with_order as first_row,
last_value(t) over win_with_order as last_row
from `project.dataset.table` t
window
win_all as (partition by div(cast(Seconds AS numeric), 1)),
win_with_order as (partition by div(cast(Seconds AS numeric), 1) order by Seconds)
)
)
where toShow
# order by Seconds
If applied to sample data in your question - output is

Select last row from each column of multi-index Pandas DataFrame based on time, when columns are unequal length

I have the following Pandas multi-index DataFrame with the top level index being a group ID and the second level index being when, in ISO 8601 time format (shown here without the time):
value weight
when
5e33c4bb4265514aab106a1a 2011-05-12 1.34 0.79
2011-05-07 1.22 0.83
2011-05-03 2.94 0.25
2011-04-28 1.78 0.89
2011-04-22 1.35 0.92
... ... ...
5e33c514392b77d517961f06 2009-01-31 30.75 0.12
2009-01-24 30.50 0.21
2009-01-23 29.50 0.96
2009-01-10 28.50 0.98
2008-12-08 28.50 0.65
when is currently defined as an index but this is not a requirement.
Assertions
when may be non-unique.
Columns may be of unequal length across groups
Within groups when, value and weight will always be of equal length (for each when there will always be a value and a weight
Question
Using the parameter index_time, how do you retrieve:
The most recent past value and weight from each group relative to index_time along with the difference (in seconds) between index_time and when.
index_time may be a time in the past such that only entries where when <= index_time are selected.
The result should be indexed in some way so that the group id of each result can be deduced
Example
From the above, if the index_time was 2011-05-10 then the result should be:
value weight age
5e33c4bb4265514aab106a1a 1.22 0.83 259200
5e33c514392b77d517961f06 30.75 0.12 72576000
Where original DataFrame given in the question is df:
import pandas as pd
df.sort_index(inplace=True)
result = df.loc[pd.IndexSlice[:, :when], :].groupby('id').tail(1)
result['age'] = when - result.index.get_level_values(level=1)

Select every nth row as a Pandas DataFrame without reading the entire file

I am reading a large file that contains ~9.5 million rows x 16 cols.
I am interested in retrieving a representative sample, and since the data is organized by time, I want to do this by selecting every 500th element.
I am able to load the data, and then select every 500th row.
My question: Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Question 2: How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
Here is a snippet of what the data looks like (first five rows) The first 4 rows are out of order, bu the remaining dataset looks ordered (by time):
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 1 2017-01-09 11:13:28 2017-01-09 11:25:45 1 3.30 1 N 263 161 1 12.5 0.0 0.5 2.00 0.00 0.3 15.30
1 1 2017-01-09 11:32:27 2017-01-09 11:36:01 1 0.90 1 N 186 234 1 5.0 0.0 0.5 1.45 0.00 0.3 7.25
2 1 2017-01-09 11:38:20 2017-01-09 11:42:05 1 1.10 1 N 164 161 1 5.5 0.0 0.5 1.00 0.00 0.3 7.30
3 1 2017-01-09 11:52:13 2017-01-09 11:57:36 1 1.10 1 N 236 75 1 6.0 0.0 0.5 1.70 0.00 0.3 8.50
4 2 2017-01-01 00:00:00 2017-01-01 00:00:00 1 0.02 2 N 249 234 2 52.0 0.0 0.5 0.00 0.00 0.3 52.80
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Something you could do is to use the skiprows parameter in read_csv, which accepts a list-like argument to discard the rows of interest (and thus, also select). So you could create a np.arange with a length equal to the amount of rows to read, and remove every 500th element from it using np.delete, so this way we'll only be reading every 500th row:
n_rows = 9.5e6
skip = np.arange(n_rows)
skip = np.delete(skip, np.arange(0, n_rows, 500))
df = pd.read_csv('my_file.csv', skiprows = skip)
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
First get the length of the file by a custom function, remove each 500 row by numpy.setdiff1d and pass to the skiprows parameter in read_csv:
#https://stackoverflow.com/q/845058
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
print (len_of_file)
skipped = np.setdiff1d(np.arange(len_of_file), np.arange(0,len_of_file,500))
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped)
How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
The idea is read only the datetime column by parameter usecols, and then sort and select each 500 index value, get the difference and pass again to parameter skiprows:
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
df1 = pd.read_csv('test.csv',
usecols=['tpep_pickup_datetime'],
parse_dates=['tpep_pickup_datetime'])
sorted_idx = (df1['tpep_pickup_datetime'].sort_values()
.iloc[np.arange(0,len_of_file,500)].index)
skipped = np.setdiff1d(np.arange(len_of_file), sorted_idx)
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped).sort_values(by=['tpep_pickup_datetime'])
use a lambda with skiprows:
pd.read_csv(path, skiprows=lambda i: i % N)
to skip every N rows.
source: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
You can use csv module return a iterator and use itertools.cycle to select every nth row.
import csv
from itertools import cycle
source_file='D:/a.txt'
cycle_size=500
chooser = (x == 0 for x in cycle(range(cycle_size)))
with open(source_file) as f1:
rdr = csv.reader(f1)
data = [row for pick, row in zip(chooser, rdr) if pick]

SQL linear interpolation based on lookup table

I need to build linear interpolation into an SQL query, using a joined table containing lookup values (more like lookup thresholds, in fact). As I am relatively new to SQL scripting, I have searched for an example code to point me in the right direction, but most of the SQL scripts I came across were for interpolating between dates and timestamps and I couldn't relate these to my situation.
Basically, I have a main data table with many rows of decimal values in a single column, for example:
Main_Value
0.33
0.12
0.56
0.42
0.1
Now, I need to yield interpolated data points for each of the rows above, based on a joined lookup table with 6 rows, containing non-linear threshold values and the associated linear normalized values:
Threshold_Level Normalized_Value
0 0
0.15 20
0.45 40
0.60 60
0.85 80
1 100
So for example, if the value in the Main_Value column is 0.45, the query will lookup its position in (or between) the nearest Threshold_Level, and interpolate this based on the adjacent value in the Normalized_Value column (which would yield a value of 40 in this example).
I really would be grateful for any insight into building a SQL query around this, especially as it has been hard to track down any SQL examples of linear interpolation using a joined table.
It has been pointed out that I could use some sort of rounding, so I have included a more detailed table below. I would like the SQL query to lookup each Main_Value (from the first table above) where it falls between the Threshold_Min and Threshold_Max values in the table below, and return the 'Normalized_%' value:
Threshold_Min Threshold_Max Normalized_%
0.00 0.15 0
0.15 0.18 5
0.18 0.22 10
0.22 0.25 15
0.25 0.28 20
0.28 0.32 25
0.32 0.35 30
0.35 0.38 35
0.38 0.42 40
0.42 0.45 45
0.45 0.60 50
0.60 0.63 55
0.63 0.66 60
0.66 0.68 65
0.68 0.71 70
0.71 0.74 75
0.74 0.77 80
0.77 0.79 85
0.79 0.82 90
0.82 0.85 95
0.85 1.00 100
For example, if the value from the Main_Value table is 0.52, it falls between Threshold_Min 0.45 and Threshold_Max 0.60, so the Normalized_% returned is 50%. The problem is that the Threshold_Min and Max values are not linear. Could anyone point me in the direction of how to script this?
Assuming you want the Main_Value and the nearest (low and not high) or equal Normalized_Value, you can do it like this:
select t1.Main_Value, max(t2.Normalized_Value) as Normalized_Value
from #t1 t1
inner join #t2 t2 on t1.Main_Value >= t2.Threshold_Level
group by t1.Main_Value
Replace #t1 and #t2 by the correct tablenames.