t test several subgroups using groupby - pandas

lets say I have this dataframe:
data = {"col1":["yes", "no", "yes", "no", "yes", "no", "yes", "no"],\
"col2":["A", "A", "B", "B", "C", "C", "D", "D"],\
"col3":[24, 20, 19, 17, 24, 27, 22, 18]}
df = pd.DataFrame(data=data)
Now, If I want to t test the col3 values between col1 yes and no, I can using the following function:
pg.ttest(*df.groupby("col1")["col3"].apply(lambda x:x.values))
But, lets say I wand to compare based on groupby both col1 and col2, e.g. comparing "A" "yes" vs "A" "no", "B" "yes" vs "B" "no" etc. I know I can groupby 2 groups, but every code I tried to fit the groupby 2 groups to t test had failed.

Related

how to sort the values in the dataframes

My dataframe:
How to sort the values like the final sorts? I don't know how to finish it by pandas.
What you are really looking to do is concatenate columns A and B to get your number. An easy way to do this is to convert it to a string, add them together, and then convert it back to an integer.
#creating dataframe
data = dict(
A=[1, 2, 3, 4, 5, 6], B=[6, 5, 4, 2, 1, 3], values=["a", "b", "c", "d", "e", "f"])
so_data = pd.DataFrame(data)
so_data["final values"] = (so_data["A"].astype(str) + so_data["B"].astype(str)).astype(int)
I am now realizing that you also have the "values" col not sorted normally either. I am not sure how that is sorted, it seems like there is some missing information to me.

How to flatten a json in snowflake? sql

I have a table "table_1" with one column called "Value" and it only has one entry. The entry in the column is a json that looks like
{
"c1": "A",
"c10": "B",
"c100": "C",
"c101": "D",
"c102": "E",
"c103": "F",
"c104": "G",
.......
}
I would like to just separate this json into two columns, where one column contains the keys (c1, c10 etc), and the second columns contains the associated values for that key (A, B etc). Is there a way I can do this? There are about 125 keys in my json
It is possible to achieve it using FLATTEN function:
CREATE OR REPLACE TABLE tab
AS
SELECT PARSE_JSON('{
"c1": "A",
"c10": "B",
"c100": "C",
"c101": "D",
"c102": "E",
"c103": "F",
"c104": "G",
}') AS col;
SELECT KEY, VALUE::TEXT AS value
FROM tab
,TABLE(FLATTEN (INPUT => tab.COL));
Output:

Drop the duplicated rows and merge the ids using groupby in pyspark

I have a dataframe where some of rows having duplicated ids but different timestamp and some of rows having duplicated ids but the same timestamp but having one of following (yob and gender) columns null. Now I want to do an operation using groupby:
if the same id having difference timestamp, want to pickup the recent timestamp.
If the same ids having same timestamp but the any of column having null(yob and gender), that time, want to merge the both id as single record without null. below I have pasted the data frame and desired output.
Input data
from pyspark.sql.functions import col, max as max_
df = sc.parallelize([
("e5882", "null", "M", "AD", "9/14/2021 13:50"),
("e5882", "null", "M", "AD", "10/22/2021 13:10"),
("5cddf", "null", "M", "ED", "9/9/2021 12:00"),
("5cddf", "2010", "null", "ED", "9/9/2021 12:00"),
("c3882", "null", "M", "BD", "11/27/2021 5:00"),
("c3882", "1975", "null", "BD", "11/27/2021 5:00"),
("9297d","1999", "null", "GF","10/18/2021 7:00"),
("9298e","1990","null","GF","10/18/2021 7:00")
]).toDF(["ID", "yob", "gender","country","timestamp"])
Desire output:
code used in this problem, but not get the accurate result, some of ids are missing,
w = Window.partitionBy('Id')
# to obtain the recent date
df1 = df.withColumn('maxB', F.max('timestamp').over(w)).where(F.col('timestamp') == F.col('maxB')).drop('maxB')
# to merge the null column based of id
(df1.groupBy('Id').agg(*[F.first(x,ignorenulls=True) for x in df1.columns if x!='Id'])).show()
Using this input dataframe:
df = spark.createDataFrame([
("e5882", None, "M", "AD", "9/14/2021 13:50"),
("e5882", None, "M", "AD", "10/22/2021 13:10"),
("5cddf", None, "M", "ED", "9/9/2021 12:00"),
("5cddf", "2010", None, "ED", "9/9/2021 12:00"),
("c3882", None, "M", "BD", "11/27/2021 5:00"),
("c3882", "1975", None, "BD", "11/27/2021 5:00"),
("9297d", None, "M", "GF", "10/18/2021 7:00"),
("9297d", "1999", None, "GF", "10/18/2021 7:00"),
("9298e", "1990", None, "GF", "10/18/2021 7:00"),
], ["id", "yob", "gender", "country", "timestamp"])
If the same id having difference timestamp, want to pickup the recent timestamp.
Use window ranking function to get most recent row per id. As you want to merge those with the same timestamp you can use dense_rank instead of row_number. But first you need to convert timestamp strings into TimestampType otherwise comparison won't be correct (as '9/9/2021 12:00' > '10/18/2021 7:00')
from pyspark.sql import Window
import pyspark.sql.functions as F
df_most_recent = df.withColumn(
"timestamp",
F.to_timestamp("timestamp", "M/d/yyyy H:mm")
).withColumn(
"rn",
F.dense_rank().over(Window.partitionBy("id").orderBy(F.desc("timestamp")))
).filter("rn = 1")
If the same ids having same timestamp but the any of column having null(yob and gender), that time, want to merge the both id as single
record without null. below I have pasted the data frame and desired
output.
Now the above df_most_recent contains one or more rows having the same most recent timestamp per id, you can group by id to merge the values of the other columns like this:
result = df_most_recent.groupBy("id").agg(
*[F.collect_set(c)[0].alias(c) for c in df.columns if c!='id']
# or *[F.first(c).alias(c) for c in df.columns if c!='id']
)
result.show()
#+-----+----+------+-------+-------------------+
#|id |yob |gender|country|timestamp |
#+-----+----+------+-------+-------------------+
#|5cddf|2010|M |ED |2021-09-09 12:00:00|
#|9297d|1999|M |GF |2021-10-18 07:00:00|
#|9298e|1990|null |GF |2021-10-18 07:00:00|
#|c3882|1975|M |BD |2021-11-27 05:00:00|
#|e5882|null|M |AD |2021-10-22 13:10:00|
#+-----+----+------+-------+-------------------+

Exclude low sample counts from Pandas' "groupby" calculations

Using Pandas, I'd like to "groupby" and calculate the mean values for each group of my Dataframe. I do it like this:
dict = {
"group": ["A", "B", "C", "A", "A", "B", "B", "C", "A"],
"value": [5, 6, 8, 7, 3, 9, 4, 6, 5]
}
import pandas as pd
df = pd.DataFrame(dict)
print(df)
g = df.groupby([df['group']]).mean()
print(g)
Which gives me:
value
group
A 5.000000
B 6.333333
C 7.000000
However, I'd like to exclude groups which have, let's say, less than 3 entries (so that the mean has somewhat of a value). In this case, it would exclude group "C" from the results. How can I implement this?
Filter the group based on the length and then take the mean.
df = df.groupby('group').filter(lambda x : len(x) > 5).mean()
#if you want the mean group-wise after filtering the required groups
result = df.groupby('group').filter(lambda x : len(x) >= 3).groupby('group').mean().reset_index()
Output:
group value
0 A 5.000000
1 B 6.333333

Merge dataframes rows if fields are in the dictionary

I want to merge fields if values are in the dictionary.
I have two pandas dataframes:
Name Values
"ABC" ["A", "B", "C"]
"DEF" ["D", "E", "F"]
and
Value First Second
"A" 0 5
"B" 2 1
"C" 3 5
"Z" 3 0
I would like to get:
Name First Second
"ABC" 5 11
"Z" 3 0
Is there an easy way to make that ? I didn't find something good
Try with explode then map the key
s = df['Value'].map(df2.explode('Values').set_index('Values')['Name'])
out = df.groupby(s).agg({'Name' : ''.join, 'First':'sum', 'Second':'sum'})