I've desinged a table below constructures.
Column | Type | Collation | Nullable | Default
-------------------------+--------------------------+-----------+----------+---------
counts | integer[] | | |
All of value in field of counts has 9 elements.
All of elements is not null.
I would like to get the sum of transposed array like the below python code with using only SQL query.
import numpy as np
counts = np.array([
[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]
])
counts = counts.transpose()
sum = list(map(lambda x: sum(x), counts))
print(sum) # [12, 15, 18, 21]
So given example data:
counts |
-----------------------|
{0,15,8,6,10,12,4,0,5} |
{0,4,6,14,7,9,8,0,9} |
{0,6,7,4,11,6,10,0,10} |
Because the record has over thousand, it takes much time 500ms to get response to calculate the sum of transpose in frontend side.
I would like a result:
counts |
---------------------------|
{0,25,21,24,28,27,22,0,24} |
or
value0 | value1 | value2 | value3 | value4 | value5 | value6 | value7 | value8 |
--------|--------|--------|--------|--------|--------|--------|--------|--------|
0 | 25 | 21 | 24 | 28 | 27 | 22 | 0 | 24 |
In my opinion, this question could be solved using SQL functions but I don't know how to use it.
Just SUM() the array elements:
SELECT SUM(i[1]) -- first
, SUM(i[2]) -- second
, SUM(i[3]) -- third
-- etc.
, SUM(i[9])
FROM (VALUES
('{0,15,8,6,10,12,4,0,5}'::int[]),
('{0,4,6,14,7,9,8,0,9}'),
('{0,6,7,4,11,6,10,0,10}')) s(i);
Related
My table column has nested arrays in a Snowflake database. I want to perform some aggregations using SQL (Snowflake SQL).
My table name is: DATA
The PROJ column is of VARIANT data type. The nested arrays will not always be 3, and I demonstrated that in the DATA table.
| ID | PROJ | LOCATION |
|----|-------------------------------|----------|
| 1 |[[0, 4], [1, 30], [10, 20]] | S |
| 2 |[[0, 2], [1, 20]] | S |
| 3 |[[0, 8], [1, 10], [10, 100]] | S |
Desired Output:
| Index | LOCATION | Min | Max | Mean|
|-------|----------|------|-----|-----|
| 0 | S | 2 | 8 | 4.66|
| 1 | S | 10 | 30 | 20 |
| 10 | S | 20 | 100| 60 |
First the nested array should be flattened, then Index is the first element of subarray and Value is the second element(array is 0-based):
CREATE OR REPLACE TABLE DATA
AS
SELECT 1 AS ID, [[0, 4], [1, 30], [10, 20]] AS PROJ UNION
SELECT 2 AS ID, [[0, 2], [1, 20]] AS PROJ UNION
SELECT 3 AS ID, [[0, 8], [1, 10], [10, 100]] AS PROJ;
Query:
SELECT s.VALUE[0]::INT AS Index,
MIN(s.VALUE[1]::INT) AS MinValue,
MAX(s.VALUE[1]::INT) AS MaxValue,
AVG(s.VALUE[1]::INT) AS MeanValue
FROM DATA
,LATERAL FLATTEN(input=> PROJ) s
GROUP BY s.VALUE[0]::INT
ORDER BY Index;
Output:
below is my data table, from my code output:
| columnA|ColumnB|ColumnC|
| ------ | ----- | ------|
| 12 | 8 | 1.34 |
| 8 | 12 | 1.34 |
| 1 | 7 | 0.25 |
I want to dedupe and only left
| columnA|ColumnB|ColumnC|
| ------ | ----- | ------|
| 12 | 8 | 1.34 |
| 1 | 7 | 0.25 |
Usually when I try to drop duplicate, I am using .drop_duplicates(subset=). But this time, I want to drop same pair,Ex:I want to drop (columnA,columnB)==(columnB,columnA). I do some research, I find someone uses set((a,b) if a<=b else (b,a) for a,b in pairs) to remove the same list pair. But I don't know how to use this method on my pandas data frame. Please help, and thank you in advance!
Convert relevant columns to frozenset:
out = df[~df[['columnA', 'ColumnB']].apply(frozenset, axis=1).duplicated()]
print(out)
# Output
columnA ColumnB ColumnC
0 12 8 1.34
2 1 7 0.25
Details:
>>> set([8, 12])
{8, 12}
>>> set([12, 8])
{8, 12}
You can combine a and b into a tuple and call drop_duplicates based on the combined columne:
t = df[["a", "b"]].apply(lambda row: tuple(set(row)), axis=1)
df.assign(t=t).drop_duplicates("t").drop(columns="t")
Possible solution is the following:
# pip install pandas
import pandas as pd
# create test dataframe
df = pd.DataFrame({"colA": [12,8,1],"colB": [8,12,1],"colC": [1.34,1.34,0.25]})
df
df.loc[df.colA > df.colB, df.columns] = df.loc[df.colA > df.colB, df.columns[[1,0,2]]].values
df.drop_duplicates()
Returns
I am using an AWS S3 stage to load .csv data into my Snowflake database.
The .csv columns are as follows:
My COPY INTO command is this:
copy into MY_TABLE(tot_completions, tot_hov, parent_id)
from (select t.$1, to_decimal(REPLACE(t.$2, ',')), 1 from #my_stage t)
pattern='.*file_path.*' file_format = my_file_format ON_ERROR=CONTINUE;
The Tot. HOV column is being automatically rounded to 40 and 1 respectively. The data type is decimal, and I tried it as a float as well, even though they should both be able to store decimals.
My desired result is to store the decimal as is displayed on the .csv without rounding. Any help would be greatly appreciated.
You need to specify the precision and scale:
create or replace table number_conv(expr varchar);
insert into number_conv values ('12.3456'), ('98.76546');
select expr, to_number(expr), to_number(expr, 10, 1), to_number(expr, 10, 8) from number_conv;
+----------+-----------------+------------------------+------------------------+
| EXPR | TO_NUMBER(EXPR) | TO_NUMBER(EXPR, 10, 1) | TO_NUMBER(EXPR, 10, 8) |
|----------+-----------------+------------------------+------------------------|
| 12.3456 | 12 | 12.3 | 12.34560000 |
| 98.76546 | 99 | 98.8 | 98.76546000 |
+----------+-----------------+------------------------+------------------------+
and:
select column1,
to_decimal(column1, '99.9') as d0,
to_decimal(column1, '99.9', 9, 5) as d5,
to_decimal(column1, 'TM9', 9, 5) as td5
from values ('1.0'), ('-12.3'), ('0.0'), (' - 0.1 ');
+---------+-----+-----------+-----------+
| COLUMN1 | D0 | D5 | TD5 |
|---------+-----+-----------+-----------|
| 1.0 | 1 | 1.00000 | 1.00000 |
| -12.3 | -12 | -12.30000 | -12.30000 |
| 0.0 | 0 | 0.00000 | 0.00000 |
| - 0.1 | 0 | -0.10000 | -0.10000 |
+---------+-----+-----------+-----------+
See more here
I have a SQL table of the format (INTEGER, json_array(INTEGER)).
I need to return results from the table that have two boolean columns. One is set to true iff a 1 appears in the json_array, and the other true iff a two appears in the array. Obviously there is not mutual exclusion.
For example, if the data were this:
-------------------------------
| ID | VALUES |
-------------------------------
| 12 | [1, 4, 6, 11] |
_______________________________
| 74 | [0, 1, 2, 5] |
-------------------------------
I would hope to get back:
-------------------------------
| ID | HAS1 | HAS2 |
-------------------------------
| 12 | true | false |
_______________________________
| 74 | true | true |
-------------------------------
I have managed to extract the json data out of the values column using json_each, but am unsure how to proceed.
If I recall correctly, SQLite max aggregate function supports boolean, therefore you can simply group by your data:
select
t1.id,
max(case json_each.value when 1 then true else false end) as has1,
max(case json_each.value when 2 then true else false end) as has2
from
yourtable t1,
json_each(t1.values)
group by
t1.id
I am using xgboost with objective='binary:logistic' to calculate each customer probability if he/she will make the spend.
Using predic_proba in sklearn will print two probability for both 0 and 1,like:
[[0.56651809 0.43348191]
[0.15598162 0.84401838]
[0.86852502 0.13147498]]
how to insert each customer ID by pandas to get something like:
+----+------------+------------+
| ID | prob_0 | prob_1 |
+----+------------+------------+
| 1 | 0.56651809 | 0.43348191 |
| 2 | 0.15598162 | 0.84401838 |
| 3 | 0.86852502 | 0.13147498 |
+----+------------+------------+
You can use pandas DataFrame() in order to make your form.
list_data = [[0.56651809, 0.43348191],[0.15598162, 0.84401838],[0.86852502, 0.13147498]]
columns = ['prob_0', 'prob_1']
index = [1, 2, 3]
pd.DataFrame(data = list_data, columns = columns, index= index)