How to convert rows in df to dictionary? - pandas

Suppose I have the following df,
Column1 | Column2 | Column3
1 | 4 | 23.2
32 | 4.2 | 62.2
9 | 12 | 2.2
I want to be able to get dictionary in the following format,
{
0: {'Column1':1, 'Column2':4, 'Column3':23.2},
1: {'Column1': 32, 'Column2':4.2, 'Column3':62.2},
2: {'Column1':9, 'Column2':12, 'Column3':2.2}
}
How can I achieve this?

final_dict = df.set_index(df.index).T.to_dict('dict')

Related

pandas outliers with and without calculations

I'm contemplating making decisions on outliers on a dataset with over 300 features. I'd like to analyse the frame without removing the data hastingly. I have a frame:
| | A | B | C | D | E |
|---:|----:|----:|-----:|----:|----:|
| 0 | 100 | 99 | 1000 | 300 | 250 |
| 1 | 665 | 6 | 9 | 1 | 9 |
| 2 | 7 | 665 | 4 | 9 | 1 |
| 3 | 1 | 3 | 4 | 3 | 6 |
| 4 | 1 | 9 | 1 | 665 | 5 |
| 5 | 3 | 4 | 6 | 1 | 9 |
| 6 | 5 | 9 | 1 | 3 | 2 |
| 7 | 1 | 665 | 3 | 2 | 3 |
| 8 | 2 | 665 | 9 | 1 | 0 |
| 9 | 5 | 0 | 7 | 6 | 5 |
| 10 | 0 | 3 | 3 | 7 | 3 |
| 11 | 6 | 3 | 0 | 3 | 6 |
| 12 | 6 | 6 | 5 | 1 | 5 |
I have coded some introspection to be saved in another frame called _outliers:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = (Q3 - Q1)
min_ = (Q1 - (1.5 * IQR))
max_ = (Q3 + (1.5 * IQR))
# Counts outliers in columns
_outliers = ((df.le (min_)) | (df.ge (max_))).sum().to_frame(name="outliers")
# Gives percentage of data that outliers represent in the column
_outliers["percent"] = (_outliers['outliers'] / _outliers['outliers'].sum()) * 100
# Shows max value in the column
_outliers["max_val"] = df[_outliers.index].max()
# Shows min value in the column
_outliers["min_val"] = df[_outliers.index].min()
# Shows median value in the column
_outliers["median"] = df[_outliers.index].median()
# Shows mean value in the column
_outliers["mean"] = df[_outliers.index].mean()
That yields:
| | outliers | percent | max_val | min_val | median | mean |
|:---|-----------:|----------:|----------:|----------:|---------:|---------:|
| A | 2 | 22.2222 | 665 | 0 | 5 | 61.6923 |
| B | 3 | 33.3333 | 665 | 0 | 6 | 164.385 |
| C | 1 | 11.1111 | 1000 | 0 | 4 | 80.9231 |
| D | 2 | 22.2222 | 665 | 1 | 3 | 77.0769 |
| E | 1 | 11.1111 | 250 | 0 | 5 | 23.3846 |
I would like to calculate the impact of the outliers on the column by calculating the mean and the median without them. I don't want to remove them to do this calculation. I suppose the best way is to add "~" to the outlier filter but I get lost in the code... This will benefit a lot of people as a search on removing outliers yields a lot of results. Other than the why they sneaked in the data in the first place, I just don't think the removal decision should be made without consideration on the potential impact. Feel free to add other considerations (skewness, sigma, n, etc.)
As always, I'm grateful to this community!
EDIT: I added variance and its square root standard deviation with and without outliers. In some fields you might want to keep outliers and go into ML directly. At least, by inspecting your data beforehand, you'll know how much they are contributing to your results. Used with nlargest() in the outliers column you get a quick view of which features contain the most. You could use this as a basis for filtering features by setting up thresholds on variance or mean. Thanks to the contributors, I have a powerful analytics tool now. Hope it can be useful to others.
Take advantage of apply method of DataFrame.
Series genereator
Just define the way you want the robust mean to apply by creating a method consuming Series and returning scalar and apply it to your DataFrame.
For the IRQ mean, here is a simple snippet:
def irq_agg(x, factor=1.5, aggregate=pd.Series.mean):
q1, q3 = x.quantile(0.25), x.quantile(0.75)
return aggregate(x[(q1 - factor*(q3 - q1) < x) & (x < q3 + factor*(q3 - q1))])
data.apply(irq_agg)
# A 3.363636
# B 14.200000
# C 4.333333
# D 3.363636
# E 4.500000
# dtype: float64
The same can be done to filter out based on percentiles (both side version):
def quantile_agg(x, alpha=0.05, aggregate=pd.Series.mean):
return aggregate(x[(x.quantile(alpha/2) < x) & (x < x.quantile(1 - alpha/2))])
data.apply(quantile_agg, alpha=0.01)
# A 12.454545
# B 15.777778
# C 4.727273
# D 41.625000
# E 4.909091
# dtype: float64
Frame generator
Even better, create a function that returns a Series, apply will create a DataFrame. Then we can compute at once a bunch of different means and medians in order to compare them. We can also reuse Series generator method defined above:
def analyze(x, alpha=0.05, factor=1.5):
return pd.Series({
"p_mean": quantile_agg(x, alpha=alpha),
"p_median": quantile_agg(x, alpha=alpha, aggregate=pd.Series.median),
"irq_mean": irq_agg(x, factor=factor),
"irq_median": irq_agg(x, factor=factor, aggregate=pd.Series.median),
"standard": x[((x - x.mean())/x.std()).abs() < 1].mean(),
"mean": x.mean(),
"median": x.median(),
})
data.apply(analyze).T
# p_mean p_median irq_mean irq_median standard mean median
# A 12.454545 5.0 3.363636 3.0 11.416667 61.692308 5.0
# B 15.777778 6.0 14.200000 5.0 14.200000 164.384615 6.0
# C 4.727273 4.0 4.333333 4.0 4.333333 80.923077 4.0
# D 41.625000 4.5 3.363636 3.0 3.363636 77.076923 3.0
# E 4.909091 5.0 4.500000 5.0 4.500000 23.384615 5.0
Now you can filter out outlier in several ways computes relevant aggregate on it such as mean or median.
No comment on whether this is an appropriate method to filter out your outliers. The code below should do what you asked:
q1, q3 = df.quantile([0.25, 0.75]).to_numpy()
delta = (q3 - q1) * 1.5
min_val, max_val = q1 - delta, q3 + delta
outliers = (df < min_val) | (max_val < df)
result = pd.concat(
[
pd.DataFrame(
{
"outliers": outliers.sum(),
"percent": outliers.sum() / outliers.sum().sum() * 100,
"max_val": max_val,
"min_val": min_val,
}
),
df.agg(["median", "mean"]).T,
df.mask(outliers, np.nan).agg(["median", "mean"]).T.add_suffix("_no_outliers"),
],
axis=1,
)
Result:
outliers percent max_val min_val median mean median_no_outliers mean_no_outliers
A 2 15.384615 13.5 -6.5 5.0 61.692308 3.0 3.363636
B 3 23.076923 243.0 -141.0 6.0 164.384615 5.0 14.200000
C 1 7.692308 13.0 -3.0 4.0 80.923077 4.0 4.333333
D 2 15.384615 16.0 -8.0 3.0 77.076923 3.0 3.363636
E 1 7.692308 10.5 -1.5 5.0 23.384615 5.0 4.500000

Pandas drop duplicate pair data in different columns

below is my data table, from my code output:
| columnA|ColumnB|ColumnC|
| ------ | ----- | ------|
| 12 | 8 | 1.34 |
| 8 | 12 | 1.34 |
| 1 | 7 | 0.25 |
I want to dedupe and only left
| columnA|ColumnB|ColumnC|
| ------ | ----- | ------|
| 12 | 8 | 1.34 |
| 1 | 7 | 0.25 |
Usually when I try to drop duplicate, I am using .drop_duplicates(subset=). But this time, I want to drop same pair,Ex:I want to drop (columnA,columnB)==(columnB,columnA). I do some research, I find someone uses set((a,b) if a<=b else (b,a) for a,b in pairs) to remove the same list pair. But I don't know how to use this method on my pandas data frame. Please help, and thank you in advance!
Convert relevant columns to frozenset:
out = df[~df[['columnA', 'ColumnB']].apply(frozenset, axis=1).duplicated()]
print(out)
# Output
columnA ColumnB ColumnC
0 12 8 1.34
2 1 7 0.25
Details:
>>> set([8, 12])
{8, 12}
>>> set([12, 8])
{8, 12}
You can combine a and b into a tuple and call drop_duplicates based on the combined columne:
t = df[["a", "b"]].apply(lambda row: tuple(set(row)), axis=1)
df.assign(t=t).drop_duplicates("t").drop(columns="t")
Possible solution is the following:
# pip install pandas
import pandas as pd
# create test dataframe
df = pd.DataFrame({"colA": [12,8,1],"colB": [8,12,1],"colC": [1.34,1.34,0.25]})
df
df.loc[df.colA > df.colB, df.columns] = df.loc[df.colA > df.colB, df.columns[[1,0,2]]].values
df.drop_duplicates()
Returns

How do you control float formatting when using DataFrame.to_markdown in pandas?

I'm trying to use DataFrame.to_markdown with a dataframe that contains float values that I'd like to have rounded off. Without to_markdown() I can just set pd.options.display.float_format and everything works fine, but to_markdown doesn't seem to be respecting that option.
Repro:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [42.42, 99.11234123412341234, -23]])
pd.options.display.float_format = '{:,.0f}'.format
print(df)
print()
print(df.to_markdown())
outputs:
0 1 2
0 1 2 3
1 42 99 -23
| | 0 | 1 | 2 |
|---:|------:|--------:|----:|
| 0 | 1 | 2 | 3 |
| 1 | 42.42 | 99.1123 | -23 |
(compare the 42.42 and 99.1123 in the to_markdown table to the 42 and 99 in the plain old df)
Is this a bug or am I missing something about how to use to_markdown?
It looks like pandas uses tabulate for this formatting. If it's installed, you can use something like:
df.to_markdown(floatfmt=".0f")
output:
| | 0 | 1 | 2 |
|---:|----:|----:|----:|
| 0 | 1 | 2 | 3 |
| 1 | 42 | 99 | -23 |

Iterate through pandas data frame and replace some strings with numbers

I have a dataframe sample_df that looks like:
bar foo
0 rejected unidentified
1 clear caution
2 caution NaN
Note this is just a random made up df, there are lot of other columns lets say with different data types than just text. bar and foo might also have lots of empty cells/values which are NaNs.
The actual df looks like this, the above is just a sample btw:
| | Unnamed: 0 | user_id | result | face_comparison_result | created_at | facial_image_integrity_result | visual_authenticity_result | properties | attempt_id |
|-----:|-------------:|:---------------------------------|:---------|:-------------------------|:--------------------|:--------------------------------|:-----------------------------|:----------------|:---------------------------------|
| 0 | 58 | ecee468d4a124a8eafeec61271cd0da1 | clear | clear | 2017-06-20 17:50:43 | clear | clear | {} | 9e4277fc1ddf4a059da3dd2db35f6c76 |
| 1 | 76 | 1895d2b1782740bb8503b9bf3edf1ead | clear | clear | 2017-06-20 13:28:00 | clear | clear | {} | ab259d3cb33b4711b0a5174e4de1d72c |
| 2 | 217 | e71b27ea145249878b10f5b3f1fb4317 | clear | clear | 2017-06-18 21:18:31 | clear | clear | {} | 2b7f1c6f3fc5416286d9f1c97b15e8f9 |
| 3 | 221 | f512dc74bd1b4c109d9bd2981518a9f8 | clear | clear | 2017-06-18 22:17:29 | clear | clear | {} | ab5989375b514968b2ff2b21095ed1ef |
| 4 | 251 | 0685c7945d1349b7a954e1a0869bae4b | clear | clear | 2017-06-18 19:54:21 | caution | clear | {} | dd1b0b2dbe234f4cb747cc054de2fdd3 |
| 5 | 253 | 1a1a994f540147ab913fcd61b7a859d9 | clear | clear | 2017-06-18 20:05:05 | clear | clear | {} | 1475037353a848318a32324539a6947e |
| 6 | 334 | 26e89e4a60f1451285e70ca8dc5bc90e | clear | clear | 2017-06-17 20:21:54 | suspected | clear | {} | 244fa3e7cfdb48afb44844f064134fec |
| 7 | 340 | 41afdea02a9c42098a15d94a05e8452b | NaN | clear | 2017-06-17 20:42:53 | clear | clear | {} | b066a4043122437bafae3ddcf6c2ab07 |
| 8 | 424 | 6cf6eb05a3cc4aabb69c19956a055eb9 | rejected | NaN | 2017-06-16 20:00:26 |
I want to replace any strings I find with numbers, per the below mapping.
def no_strings(df):
columns=list(df)
for column in columns:
df[column] = df[column].map(result_map)
#We will need a mapping of strings to numbers to be able to analyse later.
result_map = {'unidentified':0,"clear": 1, 'suspected': 2,"caution" : 3, 'rejected':4}
So the output might look like:
bar foo
0 4 0
1 1 3
2 3 NaN
For some reason, when I run no_strings(sample_df) I get errors.
What am I doing wrong?
df['bar'] = df['bar'].map(result_map)
df['foo'] = df['foo'].map(result_map)
df
bar foo
0 4 0
1 1 3
2 3 2
However, if you wish to be on the safe side (assuming a key/value is not in your result_map and you dont want to see a NaN) do this:
df['foo'] = df['foo'].map(lambda x: result_map.get(x, 'not found'))
df['bar'] = df['bar'].map(lambda x: result_map.get(x, 'not found'))
so an out put for this df
bar foo
0 rejected unidentified
1 clear caution
2 caution suspected
3 sdgdg 0000
will result in:
bar foo
0 4 0
1 1 3
2 3 2
3 not found not found
To be extra efficient:
cols = ['foo','bar','other_columns']
for c in cols:
df[c] = df[c].map(lambda x: result_map.get(x, 'not found'))
Lets try stack, map the dict and then unstack
df.stack().to_frame()[0].map(result_map).unstack()
bar foo
0 4 0
1 1 3
2 3 2

pandas cumcount in pyspark

Currently attempting to convert a script I made from pandas to pyspark, I have a dataframe that contains data in the form of:
index | letter
------|-------
0 | a
1 | a
2 | b
3 | c
4 | a
5 | a
6 | b
I want to create the following dataframe in which the occurrence count for each instance of a letter is stored, for example the first time we see "a" its occurrence count is 0, second time 1, third time 2:
index | letter | occurrence
------|--------|-----------
0 | a | 0
1 | a | 1
2 | b | 0
3 | c | 0
4 | a | 2
5 | a | 3
6 | b | 1
I can achieve this in pandas using:
df['occurrence'] = df.groupby('letter').cumcount()
How would I go about doing this in pyspark? Cannot find an existing method that is similar.
The feature you're looking for is called window functions
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
df.withColumn("occurence", row_number().over(Window.partitionBy("letter").orderBy("index")))