How to flatten a json in snowflake? sql - sql

I have a table "table_1" with one column called "Value" and it only has one entry. The entry in the column is a json that looks like
{
"c1": "A",
"c10": "B",
"c100": "C",
"c101": "D",
"c102": "E",
"c103": "F",
"c104": "G",
.......
}
I would like to just separate this json into two columns, where one column contains the keys (c1, c10 etc), and the second columns contains the associated values for that key (A, B etc). Is there a way I can do this? There are about 125 keys in my json

It is possible to achieve it using FLATTEN function:
CREATE OR REPLACE TABLE tab
AS
SELECT PARSE_JSON('{
"c1": "A",
"c10": "B",
"c100": "C",
"c101": "D",
"c102": "E",
"c103": "F",
"c104": "G",
}') AS col;
SELECT KEY, VALUE::TEXT AS value
FROM tab
,TABLE(FLATTEN (INPUT => tab.COL));
Output:

Related

t test several subgroups using groupby

lets say I have this dataframe:
data = {"col1":["yes", "no", "yes", "no", "yes", "no", "yes", "no"],\
"col2":["A", "A", "B", "B", "C", "C", "D", "D"],\
"col3":[24, 20, 19, 17, 24, 27, 22, 18]}
df = pd.DataFrame(data=data)
Now, If I want to t test the col3 values between col1 yes and no, I can using the following function:
pg.ttest(*df.groupby("col1")["col3"].apply(lambda x:x.values))
But, lets say I wand to compare based on groupby both col1 and col2, e.g. comparing "A" "yes" vs "A" "no", "B" "yes" vs "B" "no" etc. I know I can groupby 2 groups, but every code I tried to fit the groupby 2 groups to t test had failed.

trino or presto cast '["b", "a", "b", "c"]' into a values array

is there a good way to cast a json type of array '["b", "a", "b", "c"]' into an array.
e.g. something that looks like this
select array['b', 'a', 'c', 'b'];
I found it in the documentation:
select CAST(JSON '["b", "a", "b", "c"]' AS ARRAY(varchar));

Storing file with field names change in Hadoop using PIG STORE

I load a relation using
data = load 'path' using JsonLoader('class: chararray, marks: int');
datagrouped = group data on class;
total_marks = foreach datagrouped generate group as class, sum(data.marks) as Total_Score
Now I get the relation
highest_score =
A, 2130
B, 1890
C, 1640
Now I store the relation using:
Store total_marks into 'path' using JsonStorage()
My data gets stored as
{"class": "A", "Total_Score":2130}
{"class": "B", "Total_Score":1890}
{"class": "C", "Total_Score":1640}
This in my case is not the output I require. I want to output to be:
{"group": "A", "Total_Score":2130}
{"group": "B", "Total_Score":1890}
{"group": "C", "Total_Score":1640}
How can I achieve this?

Drop the duplicated rows and merge the ids using groupby in pyspark

I have a dataframe where some of rows having duplicated ids but different timestamp and some of rows having duplicated ids but the same timestamp but having one of following (yob and gender) columns null. Now I want to do an operation using groupby:
if the same id having difference timestamp, want to pickup the recent timestamp.
If the same ids having same timestamp but the any of column having null(yob and gender), that time, want to merge the both id as single record without null. below I have pasted the data frame and desired output.
Input data
from pyspark.sql.functions import col, max as max_
df = sc.parallelize([
("e5882", "null", "M", "AD", "9/14/2021 13:50"),
("e5882", "null", "M", "AD", "10/22/2021 13:10"),
("5cddf", "null", "M", "ED", "9/9/2021 12:00"),
("5cddf", "2010", "null", "ED", "9/9/2021 12:00"),
("c3882", "null", "M", "BD", "11/27/2021 5:00"),
("c3882", "1975", "null", "BD", "11/27/2021 5:00"),
("9297d","1999", "null", "GF","10/18/2021 7:00"),
("9298e","1990","null","GF","10/18/2021 7:00")
]).toDF(["ID", "yob", "gender","country","timestamp"])
Desire output:
code used in this problem, but not get the accurate result, some of ids are missing,
w = Window.partitionBy('Id')
# to obtain the recent date
df1 = df.withColumn('maxB', F.max('timestamp').over(w)).where(F.col('timestamp') == F.col('maxB')).drop('maxB')
# to merge the null column based of id
(df1.groupBy('Id').agg(*[F.first(x,ignorenulls=True) for x in df1.columns if x!='Id'])).show()
Using this input dataframe:
df = spark.createDataFrame([
("e5882", None, "M", "AD", "9/14/2021 13:50"),
("e5882", None, "M", "AD", "10/22/2021 13:10"),
("5cddf", None, "M", "ED", "9/9/2021 12:00"),
("5cddf", "2010", None, "ED", "9/9/2021 12:00"),
("c3882", None, "M", "BD", "11/27/2021 5:00"),
("c3882", "1975", None, "BD", "11/27/2021 5:00"),
("9297d", None, "M", "GF", "10/18/2021 7:00"),
("9297d", "1999", None, "GF", "10/18/2021 7:00"),
("9298e", "1990", None, "GF", "10/18/2021 7:00"),
], ["id", "yob", "gender", "country", "timestamp"])
If the same id having difference timestamp, want to pickup the recent timestamp.
Use window ranking function to get most recent row per id. As you want to merge those with the same timestamp you can use dense_rank instead of row_number. But first you need to convert timestamp strings into TimestampType otherwise comparison won't be correct (as '9/9/2021 12:00' > '10/18/2021 7:00')
from pyspark.sql import Window
import pyspark.sql.functions as F
df_most_recent = df.withColumn(
"timestamp",
F.to_timestamp("timestamp", "M/d/yyyy H:mm")
).withColumn(
"rn",
F.dense_rank().over(Window.partitionBy("id").orderBy(F.desc("timestamp")))
).filter("rn = 1")
If the same ids having same timestamp but the any of column having null(yob and gender), that time, want to merge the both id as single
record without null. below I have pasted the data frame and desired
output.
Now the above df_most_recent contains one or more rows having the same most recent timestamp per id, you can group by id to merge the values of the other columns like this:
result = df_most_recent.groupBy("id").agg(
*[F.collect_set(c)[0].alias(c) for c in df.columns if c!='id']
# or *[F.first(c).alias(c) for c in df.columns if c!='id']
)
result.show()
#+-----+----+------+-------+-------------------+
#|id |yob |gender|country|timestamp |
#+-----+----+------+-------+-------------------+
#|5cddf|2010|M |ED |2021-09-09 12:00:00|
#|9297d|1999|M |GF |2021-10-18 07:00:00|
#|9298e|1990|null |GF |2021-10-18 07:00:00|
#|c3882|1975|M |BD |2021-11-27 05:00:00|
#|e5882|null|M |AD |2021-10-22 13:10:00|
#+-----+----+------+-------+-------------------+

Merge dataframes rows if fields are in the dictionary

I want to merge fields if values are in the dictionary.
I have two pandas dataframes:
Name Values
"ABC" ["A", "B", "C"]
"DEF" ["D", "E", "F"]
and
Value First Second
"A" 0 5
"B" 2 1
"C" 3 5
"Z" 3 0
I would like to get:
Name First Second
"ABC" 5 11
"Z" 3 0
Is there an easy way to make that ? I didn't find something good
Try with explode then map the key
s = df['Value'].map(df2.explode('Values').set_index('Values')['Name'])
out = df.groupby(s).agg({'Name' : ''.join, 'First':'sum', 'Second':'sum'})