lets say I have this dataframe:
data = {"col1":["yes", "no", "yes", "no", "yes", "no", "yes", "no"],\
"col2":["A", "A", "B", "B", "C", "C", "D", "D"],\
"col3":[24, 20, 19, 17, 24, 27, 22, 18]}
df = pd.DataFrame(data=data)
Now, If I want to t test the col3 values between col1 yes and no, I can using the following function:
pg.ttest(*df.groupby("col1")["col3"].apply(lambda x:x.values))
But, lets say I wand to compare based on groupby both col1 and col2, e.g. comparing "A" "yes" vs "A" "no", "B" "yes" vs "B" "no" etc. I know I can groupby 2 groups, but every code I tried to fit the groupby 2 groups to t test had failed.
is there a good way to cast a json type of array '["b", "a", "b", "c"]' into an array.
e.g. something that looks like this
select array['b', 'a', 'c', 'b'];
I found it in the documentation:
select CAST(JSON '["b", "a", "b", "c"]' AS ARRAY(varchar));
I load a relation using
data = load 'path' using JsonLoader('class: chararray, marks: int');
datagrouped = group data on class;
total_marks = foreach datagrouped generate group as class, sum(data.marks) as Total_Score
Now I get the relation
highest_score =
A, 2130
B, 1890
C, 1640
Now I store the relation using:
Store total_marks into 'path' using JsonStorage()
My data gets stored as
{"class": "A", "Total_Score":2130}
{"class": "B", "Total_Score":1890}
{"class": "C", "Total_Score":1640}
This in my case is not the output I require. I want to output to be:
{"group": "A", "Total_Score":2130}
{"group": "B", "Total_Score":1890}
{"group": "C", "Total_Score":1640}
How can I achieve this?
I have a dataframe where some of rows having duplicated ids but different timestamp and some of rows having duplicated ids but the same timestamp but having one of following (yob and gender) columns null. Now I want to do an operation using groupby:
if the same id having difference timestamp, want to pickup the recent timestamp.
If the same ids having same timestamp but the any of column having null(yob and gender), that time, want to merge the both id as single record without null. below I have pasted the data frame and desired output.
Input data
from pyspark.sql.functions import col, max as max_
df = sc.parallelize([
("e5882", "null", "M", "AD", "9/14/2021 13:50"),
("e5882", "null", "M", "AD", "10/22/2021 13:10"),
("5cddf", "null", "M", "ED", "9/9/2021 12:00"),
("5cddf", "2010", "null", "ED", "9/9/2021 12:00"),
("c3882", "null", "M", "BD", "11/27/2021 5:00"),
("c3882", "1975", "null", "BD", "11/27/2021 5:00"),
("9297d","1999", "null", "GF","10/18/2021 7:00"),
("9298e","1990","null","GF","10/18/2021 7:00")
]).toDF(["ID", "yob", "gender","country","timestamp"])
Desire output:
code used in this problem, but not get the accurate result, some of ids are missing,
w = Window.partitionBy('Id')
# to obtain the recent date
df1 = df.withColumn('maxB', F.max('timestamp').over(w)).where(F.col('timestamp') == F.col('maxB')).drop('maxB')
# to merge the null column based of id
(df1.groupBy('Id').agg(*[F.first(x,ignorenulls=True) for x in df1.columns if x!='Id'])).show()
Using this input dataframe:
df = spark.createDataFrame([
("e5882", None, "M", "AD", "9/14/2021 13:50"),
("e5882", None, "M", "AD", "10/22/2021 13:10"),
("5cddf", None, "M", "ED", "9/9/2021 12:00"),
("5cddf", "2010", None, "ED", "9/9/2021 12:00"),
("c3882", None, "M", "BD", "11/27/2021 5:00"),
("c3882", "1975", None, "BD", "11/27/2021 5:00"),
("9297d", None, "M", "GF", "10/18/2021 7:00"),
("9297d", "1999", None, "GF", "10/18/2021 7:00"),
("9298e", "1990", None, "GF", "10/18/2021 7:00"),
], ["id", "yob", "gender", "country", "timestamp"])
If the same id having difference timestamp, want to pickup the recent timestamp.
Use window ranking function to get most recent row per id. As you want to merge those with the same timestamp you can use dense_rank instead of row_number. But first you need to convert timestamp strings into TimestampType otherwise comparison won't be correct (as '9/9/2021 12:00' > '10/18/2021 7:00')
from pyspark.sql import Window
import pyspark.sql.functions as F
df_most_recent = df.withColumn(
F.to_timestamp("timestamp", "M/d/yyyy H:mm")
).filter("rn = 1")
If the same ids having same timestamp but the any of column having null(yob and gender), that time, want to merge the both id as single
record without null. below I have pasted the data frame and desired
Now the above df_most_recent contains one or more rows having the same most recent timestamp per id, you can group by id to merge the values of the other columns like this:
result = df_most_recent.groupBy("id").agg(
*[F.collect_set(c)[0].alias(c) for c in df.columns if c!='id']
# or *[F.first(c).alias(c) for c in df.columns if c!='id']
#|id |yob |gender|country|timestamp |
#|5cddf|2010|M |ED |2021-09-09 12:00:00|
#|9297d|1999|M |GF |2021-10-18 07:00:00|
#|9298e|1990|null |GF |2021-10-18 07:00:00|
#|c3882|1975|M |BD |2021-11-27 05:00:00|
#|e5882|null|M |AD |2021-10-22 13:10:00|
I want to merge fields if values are in the dictionary.
I have two pandas dataframes:
Name Values
"ABC" ["A", "B", "C"]
"DEF" ["D", "E", "F"]
Value First Second
"A" 0 5
"B" 2 1
"C" 3 5
"Z" 3 0
I would like to get:
Name First Second
"ABC" 5 11
"Z" 3 0
Is there an easy way to make that ? I didn't find something good
Try with explode then map the key
s = df['Value'].map(df2.explode('Values').set_index('Values')['Name'])
out = df.groupby(s).agg({'Name' : ''.join, 'First':'sum', 'Second':'sum'})