Some confusion in creating pivot table - pandas

I am trying to create a pivot table but i am not getting the result i want. Couldn't able to understand why is this happening.
I have a dataframe like this -
data_channel_is_lifestyle data_channel_is_bus shares
0 0.0 0.0 593
1 0.0 1.0 711
2 0.0 1.0 1500
3 0.0 0.0 1200
4 0.0 0.0 505
And the result i am looking for is name of the columns in the index and sum of shares in the column. So
i did this -
news_copy.pivot_table(index=['data_channel_is_lifestyle','data_channel_is_bus'], values='shares', aggfunc=sum)
but i am getting the result something like this -
shares
data_channel_is_lifestyle data_channel_is_bus
0.0 0.0 107709305
1.0 19168370
1.0 0.0 7728777
I don't want these 0's and 1's, i just want the result to be something like this -
shares
data_channel_is_lifestyle 107709305
data_channel_is_bus 19168370
How can i do this?

As you put it, it's just matrix multipliation:
df.filter(like='data').T#(df[['shares']])
Output (for sample data):
shares
data_channel_is_lifestyle 0.0
data_channel_is_bus 2211.0

Related

How to open a excel file with several sheets as one single dataframe

I have an excel file with 10 sheets, every one of these 10 sheets has the name of the respondent, like> sheet1-Andres, sheet2-Paul and so on...
The columns's header are the same for all sheets also the questions. The only difference is how each respondent fill Answers A to Answers Z with zero and ones
Question Answer A Answer B Answer C
question 1 1 0 0
question 2 0 1 1
question 3 0 0 1
As the excel file has 10 sheets, I am looking for a way to open this xlsx file as a dataframe using Pandas but as one single dataframe in a way that I can after load it make some calculations, like:
How many `1` and `zeros` the Answer A got considering all the 10 respondents.
What's the best approach here?
you can pass None to the sheet_name argument and then reset the index after a concat.
df = pd.concat(
pd.read_excel('your_file', sheet_name=None)
).reset_index(0).rename(columns={'level_0':'names'})
df['names'] = df1["names"].str.replace("sheet\d{1}-", "", regex=True).str.strip()
print(df)
names Question Answer 1 Answer 2 Answer 3
0 Andreas question-1 1.0 0.0 1.0
1 Andreas question-2 0.0 1.0 0.0
2 Andreas question-3 1.0 1.0 1.0
0 Paul question-1 1.0 0.0 1.0
1 Paul question-2 0.0 1.0 0.0
2 Paul question-3 1.0 1.0 1.0

data highly skewed and values range is too large

i'am trying to rescale and normalize my dataset
my data is highly skewed and also the values range is too large which affecting my models performance
i've tried using robustscaler() and powerTransformer() and yet no improvement
below you can see the boxplot and kde plot and also skew() test of my data
df_test.agg(['skew', 'kurtosis']).transpose()
the data is financial data so it can take a large range of values ( they are not really ouliers)
Depending on your data, there are several ways to handle this. There is however a function that will help you handle skew data by doing a preliminary transformation to your normalization effort.
Go to this repo (https://github.com/datamadness/Automatic-skewness-transformation-for-Pandas-DataFrame) and download the functions skew_autotransform.py and TEST_skew_autotransform.py. Put this function in the same folder as your code. Use it in the same way as in this example:
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from skew_autotransform import skew_autotransform
exampleDF = pd.DataFrame(load_boston()['data'], columns = load_boston()['feature_names'].tolist())
transformedDF = skew_autotransform(exampleDF.copy(deep=True), plot = True, exp = False, threshold = 0.5)
print('Original average skewness value was %2.2f' %(np.mean(abs(exampleDF.skew()))))
print('Average skewness after transformation is %2.2f' %(np.mean(abs(transformedDF.skew()))))
It will return several graphs and measures of skewness of each variable, but most importantly a transformed dataframe of the handled skewed data:
Original data:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0
.. ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0
PTRATIO B LSTAT
0 15.3 396.90 4.98
1 17.8 396.90 9.14
2 17.8 392.83 4.03
3 18.7 394.63 2.94
4 18.7 396.90 5.33
.. ... ... ...
501 21.0 391.99 9.67
502 21.0 396.90 9.08
503 21.0 396.90 5.64
504 21.0 393.45 6.48
505 21.0 396.90 7.88
[506 rows x 13 columns]
and the tranformed data:
CRIM ZN INDUS CHAS NOX RM AGE \
0 -6.843991 1.708418 2.31 -587728.314092 -0.834416 6.575 201.623543
1 -4.447833 -13.373080 7.07 -587728.314092 -1.092408 6.421 260.624267
2 -4.448936 -13.373080 7.07 -587728.314092 -1.092408 7.185 184.738608
3 -4.194470 -13.373080 2.18 -587728.314092 -1.140400 6.998 125.260171
4 -3.122838 -13.373080 2.18 -587728.314092 -1.140400 7.147 157.195622
.. ... ... ... ... ... ... ...
501 -3.255759 -13.373080 11.93 -587728.314092 -0.726384 6.593 218.025321
502 -3.708638 -13.373080 11.93 -587728.314092 -0.726384 6.120 250.894792
503 -3.297348 -13.373080 11.93 -587728.314092 -0.726384 6.976 315.757117
504 -2.513274 -13.373080 11.93 -587728.314092 -0.726384 6.794 307.850962
505 -3.643173 -13.373080 11.93 -587728.314092 -0.726384 6.030 269.101967
DIS RAD TAX PTRATIO B LSTAT
0 1.264870 0.000000 1.807258 32745.311816 9.053163e+08 1.938257
1 1.418585 0.660260 1.796577 63253.425063 9.053163e+08 2.876983
2 1.418585 0.660260 1.796577 63253.425063 8.717663e+08 1.640387
3 1.571460 1.017528 1.791645 78392.216639 8.864906e+08 1.222396
4 1.571460 1.017528 1.791645 78392.216639 9.053163e+08 2.036925
.. ... ... ... ... ... ...
501 0.846506 0.000000 1.803104 129845.602554 8.649562e+08 2.970889
502 0.776403 0.000000 1.803104 129845.602554 9.053163e+08 2.866089
503 0.728829 0.000000 1.803104 129845.602554 9.053163e+08 2.120221
504 0.814408 0.000000 1.803104 129845.602554 8.768178e+08 2.329393
505 0.855697 0.000000 1.803104 129845.602554 9.053163e+08 2.635552
[506 rows x 13 columns]
After having done this, normalize the data if you need to.
Update
Given the ranges of some of your data, you need to probably do this case by case and by trial and error. There are several normalizers you can use to test different approaches. I'll give you a few of them on an example columns,
exampleDF = pd.read_csv("test.csv", sep=",")
exampleDF = pd.DataFrame(exampleDF['LiabilitiesNoncurrent_total'])
LiabilitiesNoncurrent_total
count 6.000000e+02
mean 8.865754e+08
std 3.501445e+09
min -6.307000e+08
25% 6.179232e+05
50% 1.542650e+07
75% 3.036085e+08
max 5.231900e+10
Sigmoid
Define the following function
def sigmoid(x):
e = np.exp(1)
y = 1/(1+e**(-x))
return y
and do
df = sigmoid(exampleDF.LiabilitiesNoncurrent_total)
df = pd.DataFrame(df)
'LiabilitiesNoncurrent_total' had 'positive' skewness of 8.85
The transformed one has a skewness of -2.81
Log+1 Normalization
Another approach is to use a logarithmic function and then to normalize.
def normalize(column):
upper = column.max()
lower = column.min()
y = (column - lower)/(upper-lower)
return y
df = np.log(exampleDF['LiabilitiesNoncurrent_total'] + 1)
df_normalized = normalize(df)
The skewness is reduced by approxiamately the same amount.
I would opt for this last option rather than a sigmoidal approach. I also suspect that you can apply this solution to all your features.

Finding Similarities Between houses in pandas dataframe for content filtering

I want to apply content filtering for houses. I would like to find similarity score for each houses to recommend. What can I recommend for house one? So I need similarity matrix for houses. How can I find it?
Thank you
data = [['house1',100,1500,'gas','3+1']
,['house2',120,2000,'gas','2+1']
,['house3',40,1600,'electricity','1+1']
,['house4',110,1450,'electricity','2+1']
,['house5',140,1200,'electricity','2+1']
,['house6',90,1000,'gas','3+1']
,['house7',110,1475,'gas','3+1']
]
Create the pandas DataFrame
df = pd.DataFrame(data, columns =
['house','size','price','heating_type','room_count'])
If we define similarity in terms of absolute difference in case of numeric values and similarity ratio calculated by SequenceMatcher in case of strings (or more presicely 1 - ratio to make it comparable to differences), we can apply these operations to the respective columns and then normalize the result to the range of 0 ... 1 where 1 means (almost) equality and 0 means minimum similarity. Summing up the individual columns, we get the most similar house as the house with the maximum total similarity rating.
from difflib import SequenceMatcher
df = df.set_index('house')
res = pd.DataFrame(df[['size','price']].sub(df.loc['house1',['size','price']]).abs())
res['heating_type'] = df.heating_type.apply(lambda x: 1 - SequenceMatcher(None, df.heating_type[0], x).ratio())
res['room_count'] = df.room_count.apply(lambda x: 1 - SequenceMatcher(None, df.room_count[0], x).ratio())
res['total'] = res['size'] + res.price + res.heating_type + res.room_count
res = 1 - res / res.max()
print(res)
print('\nBest match of house1 is ' + res.total[1:].idxmax())
Result:
size price heating_type room_count total
house
house1 1.000000 1.00 1.0 1.0 1.000000
house2 0.666667 0.00 1.0 0.0 0.000000
house3 0.000000 0.80 0.0 0.0 0.689942
house4 0.833333 0.90 0.0 0.0 0.882127
house5 0.333333 0.40 0.0 0.0 0.344010
house6 0.833333 0.00 1.0 1.0 0.019859
house7 0.833333 0.95 1.0 1.0 0.932735
Best match of house1 is house7

pandas data frame iterating over 2 index variables

I have a data frame with 2 indexes called "DATE"( it is monthly data) and "ID" and a column variable named Volume. Now I want to iterate over it and fill for every unique ID a new column with the average value of the column Volume in a new column.
The basic idea is to figure out which months are above the yearly avg for every ID.
list(df.index)
(Timestamp('1970-09-30 00:00:00'), 12167.0)
print(df.index.name)
None
I seemed to not find a tutorial to address this :(
Can someone please point me in the right direction
SHRCD EXCHCD SICCD PRC VOL RET SHROUT \
DATE PERMNO
1970-08-31 10559.0 10.0 1.0 5311.0 35.000 1692.0 0.030657 12048.0
12626.0 10.0 1.0 5411.0 46.250 926.0 0.088235 6624.0
12749.0 11.0 1.0 5331.0 45.500 5632.0 0.126173 34685.0
13100.0 11.0 1.0 5311.0 22.000 1759.0 0.171242 15107.0
13653.0 10.0 1.0 5311.0 13.125 141.0 0.220930 1337.0
13936.0 11.0 1.0 2331.0 11.500 270.0 -0.053061 3942.0
14322.0 11.0 1.0 5311.0 64.750 6934.0 0.024409 154187.0
16969.0 10.0 1.0 5311.0 42.875 1069.0 0.186851 13828.0
17072.0 10.0 1.0 5311.0 14.750 777.0 0.026087 5415.0
17304.0 10.0 1.0 5311.0 24.875 1939.0 0.058511 8150.0
You can use transform with year for same size Series like original DataFrame:
print (df)
VOL
DATE PERMNO
1970-08-31 10559.0 1
10559.0 2
12749.0 3
1971-08-31 13100.0 4
13100.0 5
df['avg'] = df.groupby([df.index.get_level_values(0).year, 'PERMNO'])['VOL'].transform('mean')
print (df)
VOL avg
DATE PERMNO
1970-08-31 10559.0 1 1.5
10559.0 2 1.5
12749.0 3 3.0
1971-08-31 13100.0 4 4.5
13100.0 5 4.5

adding a new column to data frame

I'm trying do something that should be really simple in pandas, but it seems anything but. I have two large dataframes
df1 has 243 columns which include:
ID2 K. C type
1 123 1. 2. T
2 132 3. 1. N
3 111 2. 1. U
df2 has 121 columns which include:
ID3 A B
1 123 0. 3.
2 111 2. 3.
3 132 1. 2.
df2 contains different information about the same ID (ID2=ID3) but in different order
I wanted to create a new column in df2 named (type) and match the type column in df1. If it's the same ID to the one in df1, it should copy the same type (T, N or U) from df1. In another word, I need it to look like the following data frame butwith all 121 columns from df2+type
ID3 A B type
123 0. 3. T
111 2. 3. U
132 1. 2. N
I tried
pd.merge and pd.join.
I also tried
df2['type'] = df1['ID2'].map(df2.set_index('ID3')['type'])
but none of them is working.
it shows KeyError: 'ID3'
As far as I can see, your last command is almost correct. Try this:
df2['type'] = df2['ID3'].map(df1.set_index('ID2')['type'])
join
df2.join(df1.set_index('ID2')['type'], on='ID3')
ID3 A B type
1 123 0.0 3.0 T
2 111 2.0 3.0 U
3 132 1.0 2.0 N
merge (take 1)
df2.merge(df1[['ID2', 'type']].rename(columns={'ID2': 'ID3'}))
ID3 A B type
0 123 0.0 3.0 T
1 111 2.0 3.0 U
2 132 1.0 2.0 N
merge (take 2)
df2.merge(df1[['ID2', 'type']], left_on='ID3', right_on='ID2').drop('ID2', 1)
ID3 A B type
0 123 0.0 3.0 T
1 111 2.0 3.0 U
2 132 1.0 2.0 N
map and assign
df2.assign(type=df2.ID3.map(dict(zip(df1.ID2, df1['type']))))
ID3 A B type
0 123 0.0 3.0 T
1 111 2.0 3.0 U
2 132 1.0 2.0 N