I am trying to repair a csv file.
Some data rows need to be removed based on a couple conditions.
Say you have the following dataframe:
-A----B-----C
000---0-----0
000---1-----0
001---0-----1
011---1-----0
001---1-----1
If two or more rows have column A in common, i want to keep the row that has column B set to 1.
The resulting dataframe should look like this:
-A----B-----C
000---1-----0
011---1-----0
001---1-----1
I've experimented with merges and drop_duplicates but cannot seem to get the result I need. It is not certain that the row with column B = 1 will be after a row with B = 0. The take_last argument of drop_duplicates seemed attractive but I don't think it applies here.
Any advice will be greatly appreciated.Thank you.
Not straight forward, but this should work
DF = pd.DataFrame({'A' : [0,0,1,11,1], 'B' : [0,1,0,1,1], 'C' : [0,0,1,0,1]})
DF.ix[DF.groupby('A').apply(lambda df: df[df.B == 1].index[0] if len(df) > 1 else df.index[0])]
A B C
1 0 1 0
4 1 1 1
3 11 1 0
Notes:
groupby divides DF into groups of rows with unique A values i.e. groups with A = 0 (2 rows), A=1 (2 rows) and A=11 (1 row)
Apply then calls the function on each group and assimilates the results
In the function (lambda) I'm looking for the index of row with value B == 1 if there is more than one row in the group, else I use the index of the default row
The result of apply is a list of index values that represent rows with B==1 if more than one row in the group else the default row for given A
The index values are then used to access the corresponding rows by ix operator
Was able to weave my way around panda to get the result I want.
It's not pretty but it gets the job done
res = DataFrame(columns=('CARD_NO', 'STATUS'))
for i in grouped.groups:
if len(grouped.groups[i]) > 1:
card_no = i
print card_no
for a in grouped.groups[card_no]:
status = df.iloc[a]['STATUS']
print 'iloc:'+str(a) +'\t'+'status:'+str(status)
if status == 1:
print 'yes'
row = pd.DataFrame([dict(CARD_NO=card_no, STATUS=status), ])
res = res.append(row, ignore_index=True)
else:
print 'no'
else:
#only 1 record found
#could be a status of 0 or 1
#add to dataframe
print 'UNIQUE RECORD'
card_no = i
print card_no
status = df.iloc[grouped.groups[card_no][0]]['STATUS']
print grouped.groups[card_no][0]
#print status
print 'iloc:'+str(grouped.groups[card_no][0]) +'\t'+'status:'+str(status)
row = pd.DataFrame([dict(CARD_NO=card_no, STATUS=status), ])
res = res.append(row, ignore_index=True)
print res
Related
I have a dataframe as a result of validation codes:
df=\
(['c_1', 'c_1', 'c_1', 'c_2', 'c_3', 'c_1', 'c_2', 'c_2'],\
['valid','valid', 'invalid','missing','invalid','valid','valid', 'missing'],\
['missing','valid','invalid','invalid','valid', 'valid','missing','missing'],\
['invalid','valid','valid', 'missing', 'missing','valid','invalid','missing'])\
.toDF('clinic_id','name','phone','city')
I counted the number of valids, invalids, and missing using aggregated code grouped by clinic_id in pyspark
agg_table = (
df
.groupBy('clinic_id')
.agg(
# name
sum(when(col('name') == 'valid',1).otherwise(0)).alias('validname')
,sum(when(col('name') == 'invalid',1).otherwise(0)).alias('invalidname')
,sum(when(col('name') == 'missing',1).otherwise(0)).alias('missingname')
# phone
,sum(when(col('phone') == 'valid',1).otherwise(0)).alias('validphone')
,sum(when(col('phone') == 'invalid',1).otherwise(0)).alias('invalidphone')
,sum(when(col('phone') == 'missing',1).otherwise(0)).alias('missingphone')
# city
,sum(when(col('city') == 'valid',1).otherwise(0)).alias('validcity')
,sum(when(col('city') == 'invalid',1).otherwise(0)).alias('invalidcity')
,sum(when(col('city') == 'missing',1).otherwise(0)).alias('missingcity')
))
display(agg_table)
output:
clinic_id validname invalidname missingname ... invalidcity missingcity
--------- --------- ----------- ----------- ... ----------- -----------
c_1 3 1 0 ... 1 0
c_2 1 0 2 ... 1 0
c_3 0 1 0 ... 0 1
the resulting aggregated table is just fine, but is not ideal for further analysis. I tried the pivoting within pyspark trying to get something below:
#note: counts below are just made up, not the actual count from above, but I hope you get what I mean.
clinic_id category name phone city
-------- ------- ---- ------- ----
c_1 valid 3 1 3
c_1 invalid 1 0 2
c_1 missing 0 2 3
c_2 valid 3 1 3
c_2 invalid 1 0 2
c_2 missing 0 2 3
c_3 valid 3 1 3
c_3 invalid 1 0 2
c_3 missing 0 2 3
I initially searched pivot/unpivot, but I learned it is called unstack in pyspark and I also came across mapping.
I tried the suggested approach in How to unstack dataset (using pivot)? but it is showing me only one column and I cannot get the desired result when I try applying it to my dataframe of 30 columns.
I also tried the following using the validated table/dataframe
expression = ""
cnt=0
for column in agg_table.columns:
if column!='clinc_id':
cnt +=1
expression += f"'{column}' , {column},"
exprs = f"stack({cnt}, {expression[:-1]}) as (Type,Value)"
unpivoted = agg_table.select('clinic_id',expr(exprs))
I get an error just pointing to the line that may be referring to a return value.
I also tried grouping the results by id and the category but that is where I am stuck at finding solution. If I group by an aggregated variable, say the values of the validname, the agggregated function only counts the values in that column and would not apply to every count columns. So I thought of inserting a column using .withColumn function assigning the three categories to each ID so that each aggregated counts will be grouped by id and category as in the prior table, but I am not feeling lucky in finding solution to this.
Also, maybe a sql approach will be easier?
I found the right search phrase: "column to row in pyspark"
One of the suggested answer that fit my dataframe is this function:
def to_long(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
to_long(df, ["clinic_id"])
This created a dataframe of three columns: clinic_id, column_names, status (valid, invalid, missing)
Then I created my aggregated table grouped by clinic_id, status:
display(long_df.groupBy('Clinic_id','Status')
.agg(
sum(when(col('column_names') == 'name',1).otherwise(0)).alias('name')
,sum(when(col('column_names') == 'phone',1).otherwise(0)).alias('phone')
,sum(when(col('column_names') == 'city',1).otherwise(0)).alias('city')
).show
I got my intended table.
is there an efficient way to concatenate strings column-wise of multiple rows of a DataFrame such that the result is a single row whose value for each column is the concatenation of each value of the same column of all given rows?
Example
Combine the first four rows as explained above.
>>> df = pd.DataFrame([["this", "this"], ["is", "is"], ["a", "a"], ["test", "test"], ["ignore", "ignore"]])
>>> df
0 1
0 this this
1 is is
2 a a
3 test test
4 ignore ignore
Both accepted results:
0 1
0 this is a test this is a test
0
1 this is a test
2 this is a test
If need join all rows without last use DataFrame.iloc with DataFrame.agg:
s = df.iloc[:-1].agg(' '.join)
print (s)
0 this is a test
1 this is a test
dtype: object
For one row DataFrame add Series.to_frame with transpose:
df = df.iloc[:-1].agg(' '.join).to_frame().T
print (df)
0 1
0 this is a test this is a test
For all rows:
s = df.agg(' '.join)
print (s)
0 this is a test ignore
1 this is a test ignore
dtype: object
df = df.agg(' '.join).to_frame().T
print (df)
0 1
0 this is a test ignore this is a test ignore
I have two data frames, based on a condition that I get from a list (which length is 2 million) I get rows that match that condition, then for those rows I change the values in columns x and y in the first data frame by the values of x and y in the second data frame. Here is my code, but it is very slow and makes my computer freeze. Any idea how I can do this more efficiently ?
for ids in List_id:
a=df1.index[(df1['id'] == ids )==True].values[0]
b=df2.index[(df2['id'] == ids )==True].values[0]
df1['x'][a] = df2['x'][b]
df1['y'][a] = df2['y'][b]
thank you
--
Example:
List_id=[1, 11 , 12 , 13]
ids=1
a=df1.index[(df1['id'] == 1 )==True].values[0]
print('a') : 234
b=df2.index[(df2['id'] == 1 )==True].values[0]
print('b') : 789
df1['x'][a] = 0
df2['x'][b] =15
So at the end I want in my data frame 1:
df1['x'][a] = df2['x'][b]
Assuming you don't have repeated id in both dataframe, you can try something like below:
step-1: filtering df2
step-2: joining df1 with filtered one
step-3 replace values in joined df and dropping extra columns.
df2_filtered=df2[df2['id'].isin(List_id)]
join_df = df1.setIndex('id').join(df2_filtered.setIndex('id'), rsuffix = "_ignore", how = 'left')
# other columns from df2 will be null, you can use that to get the rows which needs to be updated
How do I compare the value of the first row in col b and the last row in col b from grouping by col a, without using the groupby function? Because groupby function is very slow for a large dataset.
a = [1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3]
b = [1,0,0,0,0,0,7,8,0,0,0,0,0,4,1,0,0,0,0,0,1]
Return two lists: one has the group names from col a where the last value is larger than the first value, etc.
larger_or_equal = [1,3]
smaller = [2]
All numpy
a = np.array([1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3])
b = np.array([1,0,0,0,0,0,7,8,0,0,0,0,0,4,1,0,0,0,0,0,1])
w = np.where(a[1:] != a[:-1])[0] # find the edges
e = np.append(w, len(a) - 1) # define the end pos
s = np.append(0, w + 1) # define start pos
# slice end pos with boolean array. then slice groups with end postions.
# I could also have used start positions.
a[e[b[e] >= b[s]]]
a[e[b[e] < b[s]]]
[1 3]
[2]
Here is a solution without groupby. The idea is to shift column a to detect group changes:
df[df['a'].shift() != df['a']]
a b
0 1 1
7 2 8
14 3 1
df[df['a'].shift(-1) != df['a']]
a b
6 1 7
13 2 4
20 3 1
We will compare the column b in those two dataframes. We simply need to reset the index for the pandas comparison to work:
first = df[df['a'].shift() != df['a']].reset_index(drop=True)
last = df[df['a'].shift(-1) != df['a']].reset_index(drop=True)
first.loc[last['b'] >= first['b'], 'a'].values
array([1, 3])
Then do the same with < to get the other groups. Or do a set difference.
As I wrote in the comments, groupby(sort=False) might well be faster depending on your dataset.
I have a dataframe as such
ID NAME group_id
0 205292 A 183144058824253894513539088231878865676
1 475121 B 183144058824253894513539088231878865676
1 475129 C 183144058824253894513539088231878865676
I want to transform it such that row 0 is linked to the other rows in the following way
LinkedBy By_Id LinkedTo To_Id group_id
1 A 205292 B 475121 183144058824253894513539088231878865676
2 A 205292 C 475129 183144058824253894513539088231878865676
Basically, I am compressing the first dataframe by linking 0th index row against all other such that an n row df will give me a (n-1) row df. I can accomplish this without the group id (which is of type long and stays constant) by the following code:
pd.DataFrame({"LinkedBy": df['NAME'].iloc[0],"By_Id": df['ID'].iloc[0],"LinkedTo":df['NAME'].iloc[1:],"To_Id":df['ID'].iloc[1:]})
But I am facing problems while adding a group id. When I do the following
pd.DataFrame({"LinkedBy": df['NAME'].iloc[0],"By_Id": df['ID'].iloc[0],"LinkedTo":df['NAME'].iloc[1:],"To_Id":df['ID'].iloc[1:],"GroupId":df['potential_group_id'].iloc[0]})
I get OverflowError: long too big to convert
How do I add the group_id of type long to my new df.
Since your group_id in all rows appears to be the same, you could try this:
res = pd.merge(left=df.iloc[0,:], right=df.iloc[1:,:], how='right', on=['group_id'])
res.columns = ['By_Id', 'LinkedBy', 'group_id', 'To_Id', 'LinkedTo']
Note that this will only work when group_id can be used as your join key.
groupby everything and then apply with custom function
cond1 make sure 'group_id' matches
cond2 make sure 'NAME' does not match
subset df in apply function
rename and drop stuff
more renaming and dropping and resetting
def find_grp(x):
cond1 = df.group_id == x.name[2]
cond2 = df.NAME != x.name[1]
temp = df[cond1 & cond2]
rnm = dict(ID='To_ID', NAME='LinkedTo')
return temp.drop('group_id', axis=1).rename(columns=rnm)
cols = ['ID', 'NAME', 'group_id']
df1 = df.groupby(cols).apply(find_grp)
df1.index = df1.index.droplevel(-1)
df1.rename_axis(['By_ID', 'LinkedBy', 'group_id']).reset_index()
OR
df1 = df.merge(df, on='group_id', suffixes=('_By', '_To'))
df1 = df1[df1.NAME_By != df1.NAME_To]
rnm = dict(ID_By='By_ID', ID_To='To_ID', NAME_To='LinkedTo', NAME_By='LinkedBy')
df1.rename(columns=rnm)