I need to extract the data from soup item - beautifulsoup

I need to get the data from soup item
I tried some of the python code but did not work - I am new to python
import bs4
import requests
from bs4 import BeautifulSoup
r = requests.get('https://finviz.com/futures_performance.ashx')
soup = bs4.BeautifulSoup(r.text, "lxml")
soup.find_all('script')[14].text.strip()
I need to get the data like the following format from
'var rows', {"ticker", "label","group", "perf"})

The data on this page is loaded through AJAX from different URL (JSON message). We can use Python's builtin json module to parse it. The data you want is in variable data. I use it to pretty-print the values:
import requests
import json
from operator import itemgetter
url = 'https://finviz.com/api/futures_perf.ashx'
data = requests.get(url).json()
# Data is loaded in variable `data`. To print it, uncomment next line:
# print(json.dumps(data, indent=4))
print('{: ^15}{: ^15}{: ^15}{: >15}'.format('Ticker', 'Label', 'Group', 'Perf'))
print('-' * 15*4)
f = itemgetter('ticker', 'label', 'group', 'perf')
for ticker in data:
print('{: ^15}{: ^15}{: ^15}{: >15}'.format(*f(ticker)))
Prints:
Ticker Label Group Perf
------------------------------------------------------------
KC Coffee SOFTS 3.09
6N NZD CURRENCIES 0.24
GC Gold METALS 0.13
HO Heating Oil ENERGY 0.13
QA Crude Oil Brent ENERGY 0.11
NQ Nasdaq 100 INDICES 0.1
ES S&P 500 INDICES 0.07
ZB 30 Year Bond BONDS 0.06
ER2 Russell 2000 INDICES 0.06
DY DAX INDICES 0.05
ZN 10 Year Note BONDS 0.04
YM DJIA INDICES 0.03
SI Silver METALS 0.03
DX USD CURRENCIES 0.03
CL Crude Oil WTI ENERGY 0.02
PL Platinum METALS 0.01
ZF 5 Year Note BONDS 0.0
ZT 2 Year Note BONDS 0.0
6A AUD CURRENCIES 0.0
LC Live Cattle MEATS 0.0
PA Palladium METALS -0.01
6E EUR CURRENCIES -0.01
FC Feeder Cattle MEATS -0.02
6S CHF CURRENCIES -0.03
ZL Soybean oil GRAINS -0.04
RB Gasoline RBOB ENERGY -0.04
6C CAD CURRENCIES -0.05
HG Copper METALS -0.06
6B GBP CURRENCIES -0.06
6J JPY CURRENCIES -0.08
EX Euro Stoxx 50 INDICES -0.17
VX VIX INDICES -0.19
RS Canola GRAINS -0.25
ZW Wheat GRAINS -0.3
ZM Soybean Meal GRAINS -0.39
LH Lean Hogs MEATS -0.43
NKD Nikkei 225 INDICES -0.44
ZO Oats GRAINS -0.44
NG Natural Gas ENERGY -0.46
ZR Rough Rice GRAINS -0.5
ZS Soybeans GRAINS -0.53
CT Cotton SOFTS -0.58
ZC Corn GRAINS -0.62
JO Orange Juice SOFTS -1.89
SB Sugar SOFTS -2.11
ZK Ethanol ENERGY -2.76
CC Cocoa SOFTS -3.08
LB Lumber SOFTS -3.59
The variable data is list of items:
[
{
"ticker": "KC",
"label": "Coffee",
"group": "SOFTS",
"perf": 3.09
},
{
"ticker": "6N",
"label": "NZD",
"group": "CURRENCIES",
"perf": 0.22
},
{
"ticker": "GC",
"label": "Gold",
"group": "METALS",
"perf": 0.13
},
... and so on.

Related

Problem to make strip() function work properly in the entire dataframe

the strip function is not properly working with one country in the data frame
"""
<Country Energy Supply Energy Supply per Capita % Renewable
0 Afghanistan 3.210000e+08 10.0 78.669280
1 Albania 1.020000e+08 35.0 100.000000
2 Algeria 1.959000e+09 51.0 0.551010
3 American Samoa NaN NaN 0.641026
4 Andorra 9.000000e+06 121.0 88.695650
5 Angola 6.420000e+08 27.0 70.909090
6 Anguilla 2.000000e+06 136.0 0.000000
7 Antigua and Barbuda 8.000000e+06 84.0 0.000000
8 Argentina 3.378000e+09 79.0 24.064520
9 Armenia 1.430000e+08 48.0 28.236060
10 Aruba 1.200000e+07 120.0 14.870690
11 Australia 5.386000e+09 231.0 11.810810
12 Austria 1.391000e+09 164.0 72.452820
13 Azerbaijan 5.670000e+08 60.0 6.384345
14 Bahamas 4.500000e+07 118.0 0.000000
here is the code i apply to get rid of the outter spaces in each cell
Energy = pd.read_excel("assets/Energy Indicators.xls", header = 17, skipfooter = 38)
Energy.pop('Unnamed: 0')
Energy.pop('Unnamed: 1')
Energy.columns.values[0:4] =['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']
Energy = Energy.replace('...', np.nan)
**Energy = Energy.replace('Iran ', 'Iran')**
**Energy['Country'] = Energy['Country'].str.strip()**
Energy['Energy Supply'] = Energy['Energy Supply']. multiply(1000000)
Energy['Country'] = Energy['Country'].str.replace("\(.*\)","")
Energy['Country'] = Energy['Country'].str.replace('\d+', "")
Energy['Country'] = Energy['Country'].replace(["Republic of Korea", "United States of America", "United Kingdom of Great Britain and Northern Ireland", "China, Hong Kong Special Administrative Region"], ["South Korea", "United States", "United Kingdom", "Hong Kong"])
**(Energy == 'Iran ').sum()**
**Country 1**
Energy Supply 0
Energy Supply per Capita 0
% Renewable 0
dtype: int64
However Iran keep having the space on its right size
I would like to get rid of Iran space either applying a function to the whole dataframe or by directly removing the space of Iran
You might want to select the string columns and apply str.strip:
cols = Energy.select_dtypes('object').columns
df[cols] = df[cols].apply(lambda s: s.str.strip())

Sklearn only predicts one class while dataset is fairly balanced (±80/20 split)

I am trying to come up with a way to check what are the most influential factors of a person not paying back a loan (defaulting). I have worked with the sklearn library quite intensively, but I feel like I am missing something quite trivial...
The dataframe looks like this:
0 7590-VHVEG Female Widowed Electronic check Outstanding loan 52000 20550 108 0.099 288.205374 31126.180361 0 No Employed No Dutch No 0
1 5575-GNVDE Male Married Bank transfer Other 42000 22370 48 0.083 549.272708 26365.089987 0 Yes Employed No Dutch No 0
2 3668-QPYBK Male Registered partnership Bank transfer Study 44000 24320 25 0.087 1067.134272 26678.356802 0 No Self-Employed No Dutch No 0
The distribution of the "DefaultInd" column (target variable) is this:
0 0.835408
1 0.164592
Name: DefaultInd, dtype: float64
I have label encoded the data to make it look like this, :
CustomerID Gender MaritalStatus PaymentMethod SpendingTarget EstimatedIncome CreditAmount TermLoanMonths YearlyInterestRate MonthlyCharges TotalAmountPayments CurrentLoans SustainabilityIndicator EmploymentStatus ExistingCustomer Nationality BKR_Registration DefaultInd
0 7590-VHVEG 0 4 2 2 52000 20550 108 0.099 288.205374 31126.180361 0 0 0 0 5 0 0
1 5575-GNVDE 1 1 0 1 42000 22370 48 0.083 549.272708 26365.089987 0 1 0 0 5 0 0
2 3668-QPYBK 1 2 0 4 44000 24320 25 0.087 1067.134272 26678.356802 0 0 2 0 5 0
After that I have removed NaNs and cleaned it up some more (removing capitalizion, punctuation etc)
After that, I try to run this cell:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
y = df['DefaultInd']
X = df.drop(['CustomerID','DefaultInd'],axis=1)
X = X.astype(float)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=42)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(classification_report(y_test, y_pred))
Which results in this:
precision recall f1-score support
0 0.83 1.00 0.91 1073
1 0.00 0.00 0.00 213
accuracy 0.83 1286
macro avg 0.42 0.50 0.45 1286
weighted avg 0.70 0.83 0.76 1286
As you can see, the "1" class does not get predicted 1 time, I am wondering whether or not this behaviour is to be expected (I think it is not). I tried to use class_weightd = ‘balanced’, but that resulted in an average f1 score of 0.59 (instead of 0.76)
I feel like I am missing something, or is this kind of behaviour expected and should I rebalance the dataset before fitting? I feel like the division is not that skewed (±80/20), there should not be this big of a problem.
Any help would be more than appreciated :)

data highly skewed and values range is too large

i'am trying to rescale and normalize my dataset
my data is highly skewed and also the values range is too large which affecting my models performance
i've tried using robustscaler() and powerTransformer() and yet no improvement
below you can see the boxplot and kde plot and also skew() test of my data
df_test.agg(['skew', 'kurtosis']).transpose()
the data is financial data so it can take a large range of values ( they are not really ouliers)
Depending on your data, there are several ways to handle this. There is however a function that will help you handle skew data by doing a preliminary transformation to your normalization effort.
Go to this repo (https://github.com/datamadness/Automatic-skewness-transformation-for-Pandas-DataFrame) and download the functions skew_autotransform.py and TEST_skew_autotransform.py. Put this function in the same folder as your code. Use it in the same way as in this example:
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from skew_autotransform import skew_autotransform
exampleDF = pd.DataFrame(load_boston()['data'], columns = load_boston()['feature_names'].tolist())
transformedDF = skew_autotransform(exampleDF.copy(deep=True), plot = True, exp = False, threshold = 0.5)
print('Original average skewness value was %2.2f' %(np.mean(abs(exampleDF.skew()))))
print('Average skewness after transformation is %2.2f' %(np.mean(abs(transformedDF.skew()))))
It will return several graphs and measures of skewness of each variable, but most importantly a transformed dataframe of the handled skewed data:
Original data:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0
.. ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0
PTRATIO B LSTAT
0 15.3 396.90 4.98
1 17.8 396.90 9.14
2 17.8 392.83 4.03
3 18.7 394.63 2.94
4 18.7 396.90 5.33
.. ... ... ...
501 21.0 391.99 9.67
502 21.0 396.90 9.08
503 21.0 396.90 5.64
504 21.0 393.45 6.48
505 21.0 396.90 7.88
[506 rows x 13 columns]
and the tranformed data:
CRIM ZN INDUS CHAS NOX RM AGE \
0 -6.843991 1.708418 2.31 -587728.314092 -0.834416 6.575 201.623543
1 -4.447833 -13.373080 7.07 -587728.314092 -1.092408 6.421 260.624267
2 -4.448936 -13.373080 7.07 -587728.314092 -1.092408 7.185 184.738608
3 -4.194470 -13.373080 2.18 -587728.314092 -1.140400 6.998 125.260171
4 -3.122838 -13.373080 2.18 -587728.314092 -1.140400 7.147 157.195622
.. ... ... ... ... ... ... ...
501 -3.255759 -13.373080 11.93 -587728.314092 -0.726384 6.593 218.025321
502 -3.708638 -13.373080 11.93 -587728.314092 -0.726384 6.120 250.894792
503 -3.297348 -13.373080 11.93 -587728.314092 -0.726384 6.976 315.757117
504 -2.513274 -13.373080 11.93 -587728.314092 -0.726384 6.794 307.850962
505 -3.643173 -13.373080 11.93 -587728.314092 -0.726384 6.030 269.101967
DIS RAD TAX PTRATIO B LSTAT
0 1.264870 0.000000 1.807258 32745.311816 9.053163e+08 1.938257
1 1.418585 0.660260 1.796577 63253.425063 9.053163e+08 2.876983
2 1.418585 0.660260 1.796577 63253.425063 8.717663e+08 1.640387
3 1.571460 1.017528 1.791645 78392.216639 8.864906e+08 1.222396
4 1.571460 1.017528 1.791645 78392.216639 9.053163e+08 2.036925
.. ... ... ... ... ... ...
501 0.846506 0.000000 1.803104 129845.602554 8.649562e+08 2.970889
502 0.776403 0.000000 1.803104 129845.602554 9.053163e+08 2.866089
503 0.728829 0.000000 1.803104 129845.602554 9.053163e+08 2.120221
504 0.814408 0.000000 1.803104 129845.602554 8.768178e+08 2.329393
505 0.855697 0.000000 1.803104 129845.602554 9.053163e+08 2.635552
[506 rows x 13 columns]
After having done this, normalize the data if you need to.
Update
Given the ranges of some of your data, you need to probably do this case by case and by trial and error. There are several normalizers you can use to test different approaches. I'll give you a few of them on an example columns,
exampleDF = pd.read_csv("test.csv", sep=",")
exampleDF = pd.DataFrame(exampleDF['LiabilitiesNoncurrent_total'])
LiabilitiesNoncurrent_total
count 6.000000e+02
mean 8.865754e+08
std 3.501445e+09
min -6.307000e+08
25% 6.179232e+05
50% 1.542650e+07
75% 3.036085e+08
max 5.231900e+10
Sigmoid
Define the following function
def sigmoid(x):
e = np.exp(1)
y = 1/(1+e**(-x))
return y
and do
df = sigmoid(exampleDF.LiabilitiesNoncurrent_total)
df = pd.DataFrame(df)
'LiabilitiesNoncurrent_total' had 'positive' skewness of 8.85
The transformed one has a skewness of -2.81
Log+1 Normalization
Another approach is to use a logarithmic function and then to normalize.
def normalize(column):
upper = column.max()
lower = column.min()
y = (column - lower)/(upper-lower)
return y
df = np.log(exampleDF['LiabilitiesNoncurrent_total'] + 1)
df_normalized = normalize(df)
The skewness is reduced by approxiamately the same amount.
I would opt for this last option rather than a sigmoidal approach. I also suspect that you can apply this solution to all your features.

pandas time-weighted average groupby in panel data

Hi I have a panel data set looks like
stock date time spread1 weight spread2
VOD 01-01 9:05 0.01 0.03 ...
VOD 01-01 9.12 0.03 0.05 ...
VOD 01-01 10.04 0.02 0.30 ...
VOD 01-02 11.04 0.02 0.05
... ... ... .... ...
BAT 01-01 0.05 0.04 0.03
BAT 01-01 0.07 0.05 0.03
BAT 01-01 0.10 0.06 0.04
I want to calculate the weighted average of spread1 for each stock in each day. I can break the solution into several steps. i.e. I can apply groupby and agg function to get the sum of spread1*weight for each stock in each day in dataframe1, and then calculate the sum of weight for each stock in each day in dataframe2. After that merge two data sets and get weighted average for spread1.
My question is is there any simple way to calculate weighted average of spread1 here ? I also have spread2, spread3 and spread4. So I want to write as fewer code as possible. Thanks
IIUC, you need to transform the result back to the original, but using .transform with output that depends on two columns is tricky. We write our own function, where we pass the series of spread s and the original DataFrame df so we can also use the weights:
import numpy as np
def weighted_avg(s, df):
return np.average(s, weights=df.loc[df.index.isin(s.index), 'weight'])
df['spread1_avg'] = df.groupby(['stock', 'date']).spread1.transform(weighted_avg, df)
Output:
stock date time spread1 weight spread1_avg
0 VOD 01-01 9:05 0.01 0.03 0.020526
1 VOD 01-01 9.12 0.03 0.05 0.020526
2 VOD 01-01 10.04 0.02 0.30 0.020526
3 VOD 01-02 11.04 0.02 0.05 0.020000
4 BAT 01-01 0.05 0.04 0.03 0.051000
5 BAT 01-01 0.07 0.05 0.03 0.051000
6 BAT 01-01 0.10 0.06 0.04 0.051000
If needed for multiple columns:
gp = df.groupby(['stock', 'date'])
for col in [f'spread{i}' for i in range(1,5)]:
df[f'{col}_avg'] = gp[col].transform(weighted_avg, df)
Alternatively, if you don't need to transform back and one want value per stock-date:
def my_avg2(gp):
avg = np.average(gp.filter(like='spread'), weights=gp.weight, axis=0)
return pd.Series(avg, index=[col for col in gp.columns if col.startswith('spread')])
### Create some dummy data
df['spread2'] = df.spread1+1
df['spread3'] = df.spread1+12.1
df['spread4'] = df.spread1+1.13
df.groupby(['stock', 'date'])[['weight'] + [f'spread{i}' for i in range(1,5)]].apply(my_avg2)
# spread1 spread2 spread3 spread4
#stock date
#BAT 01-01 0.051000 1.051000 12.151000 1.181000
#VOD 01-01 0.020526 1.020526 12.120526 1.150526
# 01-02 0.020000 1.020000 12.120000 1.150000

How can I efficiently disaggregate data in a Dataframe (given a set of weights, mapping, etc.)?

I have a dataframe that holds data at a particular level of aggregation - let's call it regional.
I also have a dict that explains how these regions are formed. Something like this:
map = {'Alabama': 'region_1', 'Arizona': 'region_1', 'Arkansas': 'region_2' ... }
And a set of weights for each state within its region, stored as a series:
Alabama .25
Arizona .75
Arkansas .33
....
Is there an efficient way to apply this disaggregation map to get a new dataframe at a State level?
Aggregation is easy:
df_regional = df_states.groupby(map).sum()
But how can I do disaggregation?
Assuming two dataframes, df_states and df_regional, with the following
structure:
In [36]: df_states
Out[36]:
Weight Region
Alabama 0.25 region_1
Arizona 0.75 region_1
Arkansas 0.33 region_2
In [37]: df_regional
Out[37]:
Value
region_1 100
region_2 80
Does pandas.merge arrange the data in a way that seems useful?
In [39]: df = pandas.merge(df_states, df_regional, left_on='Region', right_index=True)
In [40]: df
Out[40]:
Weight Region Value
Alabama 0.25 region_1 100
Arizona 0.75 region_1 100
Arkansas 0.33 region_2 80
In [41]: df.Weight * df.Value
Out[41]:
Alabama 25.0
Arizona 75.0
Arkansas 26.4
In [238]: map = {'Alabama': 'region_1', 'Arizona': 'region_1', 'Arkansas': 'region_2'}
In [239]: weigths = pandas.Series([.25, .75, .33], index=['Alabama', 'Arizona', 'Arkansas'])
In [240]: df_states = pandas.DataFrame({'map': pandas.Series(map), 'weigths': weigths})
In [241]: df_states
Out[241]:
map weigths
Alabama region_1 0.25
Arizona region_1 0.75
Arkansas region_2 0.33
In [242]: df_regional = df_states.groupby('map').sum()
In [243]: df_regional
Out[243]:
weigths
map
region_1 1.00
region_2 0.33