add a categorical column with three values assigned to each row in a pyspark df then perform aggregated functions to 30 columns - dataframe

I have a dataframe as a result of validation codes:
df=\
(['c_1', 'c_1', 'c_1', 'c_2', 'c_3', 'c_1', 'c_2', 'c_2'],\
['valid','valid', 'invalid','missing','invalid','valid','valid', 'missing'],\
['missing','valid','invalid','invalid','valid', 'valid','missing','missing'],\
['invalid','valid','valid', 'missing', 'missing','valid','invalid','missing'])\
.toDF('clinic_id','name','phone','city')
I counted the number of valids, invalids, and missing using aggregated code grouped by clinic_id in pyspark
agg_table = (
df
.groupBy('clinic_id')
.agg(
# name
sum(when(col('name') == 'valid',1).otherwise(0)).alias('validname')
,sum(when(col('name') == 'invalid',1).otherwise(0)).alias('invalidname')
,sum(when(col('name') == 'missing',1).otherwise(0)).alias('missingname')
# phone
,sum(when(col('phone') == 'valid',1).otherwise(0)).alias('validphone')
,sum(when(col('phone') == 'invalid',1).otherwise(0)).alias('invalidphone')
,sum(when(col('phone') == 'missing',1).otherwise(0)).alias('missingphone')
# city
,sum(when(col('city') == 'valid',1).otherwise(0)).alias('validcity')
,sum(when(col('city') == 'invalid',1).otherwise(0)).alias('invalidcity')
,sum(when(col('city') == 'missing',1).otherwise(0)).alias('missingcity')
))
display(agg_table)
output:
clinic_id validname invalidname missingname ... invalidcity missingcity
--------- --------- ----------- ----------- ... ----------- -----------
c_1 3 1 0 ... 1 0
c_2 1 0 2 ... 1 0
c_3 0 1 0 ... 0 1
the resulting aggregated table is just fine, but is not ideal for further analysis. I tried the pivoting within pyspark trying to get something below:
#note: counts below are just made up, not the actual count from above, but I hope you get what I mean.
clinic_id category name phone city
-------- ------- ---- ------- ----
c_1 valid 3 1 3
c_1 invalid 1 0 2
c_1 missing 0 2 3
c_2 valid 3 1 3
c_2 invalid 1 0 2
c_2 missing 0 2 3
c_3 valid 3 1 3
c_3 invalid 1 0 2
c_3 missing 0 2 3
I initially searched pivot/unpivot, but I learned it is called unstack in pyspark and I also came across mapping.
I tried the suggested approach in How to unstack dataset (using pivot)? but it is showing me only one column and I cannot get the desired result when I try applying it to my dataframe of 30 columns.
I also tried the following using the validated table/dataframe
expression = ""
cnt=0
for column in agg_table.columns:
if column!='clinc_id':
cnt +=1
expression += f"'{column}' , {column},"
exprs = f"stack({cnt}, {expression[:-1]}) as (Type,Value)"
unpivoted = agg_table.select('clinic_id',expr(exprs))
I get an error just pointing to the line that may be referring to a return value.
I also tried grouping the results by id and the category but that is where I am stuck at finding solution. If I group by an aggregated variable, say the values of the validname, the agggregated function only counts the values in that column and would not apply to every count columns. So I thought of inserting a column using .withColumn function assigning the three categories to each ID so that each aggregated counts will be grouped by id and category as in the prior table, but I am not feeling lucky in finding solution to this.
Also, maybe a sql approach will be easier?

I found the right search phrase: "column to row in pyspark"
One of the suggested answer that fit my dataframe is this function:
def to_long(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
to_long(df, ["clinic_id"])
This created a dataframe of three columns: clinic_id, column_names, status (valid, invalid, missing)
Then I created my aggregated table grouped by clinic_id, status:
display(long_df.groupBy('Clinic_id','Status')
.agg(
sum(when(col('column_names') == 'name',1).otherwise(0)).alias('name')
,sum(when(col('column_names') == 'phone',1).otherwise(0)).alias('phone')
,sum(when(col('column_names') == 'city',1).otherwise(0)).alias('city')
).show
I got my intended table.

Related

Dask: how to attribute the value from one row to another one while working with a huge CSV file - ValueError: Arrays chunk sizes are unknown: (nan,)

I have a big CSV file (290GB), which I read using dask. It contains info on birth years of lots of individuals (and their parents). I need to create a new column 'EVENT_DATE' that will contain the birth year for Individuals (type I) and the birth year of children for their parents (type P)
I know it is a big file and will take some time to process, but I have the impression I am not using dask in the correct way.
The original data looks like this (with many more columns):
id
sourcename
type
event
birth_year
father_id
mother_id
1
source_A
I
B
1789
2
3
2
source_A
P
B
3
source_A
P
B
..
...
...
...
...
...
...
n
source_B
I
B
1800
x
y
And what I'd like to obtain is something like this:
id
sourcename
type
event
birth_year
father_id
mother_id
EVENT_DATE
1
source_A
I
B
1789
2
3
1789
2
source_A
P
B
1789
3
source_A
P
B
1789
..
...
...
...
...
...
...
...
n
source_B
I
B
1800
x
y
1800
I filter the ddf using a list of unique values "sourcenames" for the column 'sourcename' and then iterate this operation to work on smaller data frames.
I repartition and then perform operations on these slices
I want to save them as separate parquet files
My code so far, look like this:
ddf = dd.read_csv('raw_data.csv')
# I create the EVENT_DATE column:
ddf['EVENT_DATE'] = pd.NA
# I then create a smaller df based on sourcename
for source in sourcenames:
df = ddf[ddf.sourcename == source]
df = df.repartition(partition_size="100MB")
# I then add the info on the birth year in a new EVENT_DATE COLUMN
df['EVENT_DATE'] = df.apply(lambda x: x['birth_year'] if x['type'] == 'I' and x['event'] == 'B', axis=1, meta=(None, 'object'))
# I then try to match parents' EVENT_DATE to the birth year of their children. I thought that restricting to 10 rows above and below might speed up calculations since I know it is very unlikely that observations are going to be far apart in the dataset:
df['EVENT_DATE'] = df.apply(lambda x: df[(df.index > x.name - 10) & (df.index < x.name + 10) & ((df['father_id'] == x['id']) | (df['mother_id'] == x['id']))]['EVENT_DATE'].values[0] if x['type'] == 'P' else x['EVENT_DATE'], axis=1, meta=(None, 'object'))
This gives me the following error
ValueError: Arrays chunk sizes are unknown: (nan,)
A possible solution: https://docs.dask.org/en/latest/array-chunks.html#unknown-chunks
Summary: to compute chunks sizes, use
x.compute_chunk_sizes() # for Dask Array x ddf.to_dask_array(lengths=True) # for Dask DataFrame ddf
I think I may understand what the issue is, but I certainly don't understand how to solve it.
Any help would be immensely appreciated.

Pandas DataFrame read_csv then GroupBy - How to get just a single count instead of one per column

I'm getting the counts I want, but I don't understand why it is creating a separate count for each data column. How can I create just one column called "count"? Would the counts only be different when a column as a Null (NAN) value?
Also, what are the actual column names below? Is the column name a tuple?
Can I change the groupby/agg to return just one column called "Count"?
CSV Data:
'Occupation','col1','col2'
'Carpenter','data1','x'
'Carpenter','data2','y'
'Carpenter','data3','z'
'Painter','data1','x'
'Painter','data2','y'
'Programmer','data1','z'
'Programmer','data2','x'
'Programmer','data3','y'
'Programmer','data4','z'
Program:
filename = "./data/TestGroup.csv"
df = pd.read_csv(filename)
print(df.head())
print("Computing stats by HandRank... ")
df_stats = df.groupby("'Occupation'").agg(['count'])
print(df_stats.head())
print("----- Columns-----")
for col_name in df_stats.columns:
print(col_name)
Output:
Computing stats by HandRank...
'col1' 'col2'
count count
'Occupation'
'Carpenter' 3 3
'Painter' 2 2
'Programmer' 4 4
----- Columns-----
("'col1'", 'count')
("'col2'", 'count')
The df.head() shows it is using "Occupation" as my column name.
Try with size
df_stats = df.groupby("'Occupation'").size().to_frame('count')

matching consecutive pairs in pd.Series

I have a DataFrame which looks like this :-
ID | act
1 A
1 B
1 C
1 D
2 A
2 B
3 A
3 C
I am trying to get the IDs where an activity act1 is followed by another act2, for example, A is followed by B. In that case, I want to get [1,2] as the ids. How do I go about this in a vectorized manner?
Edit :- Expected output : For the sample df defined above, the output should be a list/Series of all the IDs where A is followed immediately by B
IDs
1
2
Here is a simple, vectorised way to do it!
df.loc[(df.act == 'A') & (df.act.shift(-1) == 'B') & (df.ID == df.ID.shift(-1)), 'ID']
Output:
0 1
4 2
Name: ID, dtype: int64
Another way of writing this, possibly clearer:
conditions = (df.act == 'A') & (df.act.shift(-1) == 'B') & (df.ID == df.ID.shift(-1))
df.loc[conditions, 'ID']
Numpy makes it easy to filter for one or many boolean conditions. The resulting vector is used to filter your dataframe.
Here is one approach: groupby, and don't sort, since we need to track B immediately following A, based on the current dataframe structure.
Next aggregate using str.cat
check if A,B is present
get the index
pass as a list
(df
.groupby('ID',sort=False)
.Act
.agg(lambda x: x.str.cat(sep=','))
.str.contains('A,B')
.loc[lambda x: x==1]
.index.tolist()
)
[1, 2]
Another approach is using the shift function and filtering:
df['x'] = df.Act.shift()
df.loc[lambda x: (x['Act']=='B') & (x['x']=='A')].ID

pandas syntax examples confusion

I am confused by some of the examples I see for pandas. For example this is shortened from a post I recently read:
df[df.duplicated()|df()]
What I don't understand is why df needs to be on the outside: df[df.duplicated()]
vs just using df.duplicated(). In the documentation I have not yet seen the first example, everything is presented in the format df.something_doing(). But I see many examples such as df[df.something_doing()] and I don't understand what the df on the outside does.
df.duplicated() returns the boolean values. They provide a mask with True if the condition mentioned is satisfied, False otherwise.
If you want a slice of the dataframe based on the boolean mask, you need:
df[df.duplicated()]
Another simple example, consider this dataframe
col1 id
0 1 a
1 0 a
2 1 a
3 1 b
If you only want the columns where 'id' is 'a',
df.id == 'a'
would give you boolean mask but
df[df.id == 'a']
would return the dataframe
col1 id
0 1 a
1 0 a
2 1 a

Pandas dataframe row removal

I am trying to repair a csv file.
Some data rows need to be removed based on a couple conditions.
Say you have the following dataframe:
-A----B-----C
000---0-----0
000---1-----0
001---0-----1
011---1-----0
001---1-----1
If two or more rows have column A in common, i want to keep the row that has column B set to 1.
The resulting dataframe should look like this:
-A----B-----C
000---1-----0
011---1-----0
001---1-----1
I've experimented with merges and drop_duplicates but cannot seem to get the result I need. It is not certain that the row with column B = 1 will be after a row with B = 0. The take_last argument of drop_duplicates seemed attractive but I don't think it applies here.
Any advice will be greatly appreciated.Thank you.
Not straight forward, but this should work
DF = pd.DataFrame({'A' : [0,0,1,11,1], 'B' : [0,1,0,1,1], 'C' : [0,0,1,0,1]})
DF.ix[DF.groupby('A').apply(lambda df: df[df.B == 1].index[0] if len(df) > 1 else df.index[0])]
A B C
1 0 1 0
4 1 1 1
3 11 1 0
Notes:
groupby divides DF into groups of rows with unique A values i.e. groups with A = 0 (2 rows), A=1 (2 rows) and A=11 (1 row)
Apply then calls the function on each group and assimilates the results
In the function (lambda) I'm looking for the index of row with value B == 1 if there is more than one row in the group, else I use the index of the default row
The result of apply is a list of index values that represent rows with B==1 if more than one row in the group else the default row for given A
The index values are then used to access the corresponding rows by ix operator
Was able to weave my way around panda to get the result I want.
It's not pretty but it gets the job done
res = DataFrame(columns=('CARD_NO', 'STATUS'))
for i in grouped.groups:
if len(grouped.groups[i]) > 1:
card_no = i
print card_no
for a in grouped.groups[card_no]:
status = df.iloc[a]['STATUS']
print 'iloc:'+str(a) +'\t'+'status:'+str(status)
if status == 1:
print 'yes'
row = pd.DataFrame([dict(CARD_NO=card_no, STATUS=status), ])
res = res.append(row, ignore_index=True)
else:
print 'no'
else:
#only 1 record found
#could be a status of 0 or 1
#add to dataframe
print 'UNIQUE RECORD'
card_no = i
print card_no
status = df.iloc[grouped.groups[card_no][0]]['STATUS']
print grouped.groups[card_no][0]
#print status
print 'iloc:'+str(grouped.groups[card_no][0]) +'\t'+'status:'+str(status)
row = pd.DataFrame([dict(CARD_NO=card_no, STATUS=status), ])
res = res.append(row, ignore_index=True)
print res