Jupyter Notebook Truncates Python Output [duplicate] - pandas

This question already has answers here:
How can I display full (non-truncated) dataframe information in HTML when converting from Pandas dataframe to HTML?
(10 answers)
Closed 3 years ago.
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
data = pd.read_csv(...)
data.columns
Given the code above I am expecting to see a complete list of the 668 columns in this data set. Instead the output is truncated like this:
Index(['VIN_SIGNI_PATTRN_MASK', 'NCI_MAK_ABBR_CD', 'MDL_YR', 'VEH_TYP_CD',
'VEH_TYP_DESC', 'MAK_NM', 'MDL_DESC', 'TRIM_DESC', 'OPT1_TRIM_DESC',
'OPT2_TRIM_DESC',
...
'EPA_SMART_WAY_DESC', 'MA_COLL_SYMB', 'MA_COMP_SYMB', 'MA_BASE_SYMB',
'MA_VSR_SYMB', 'MA_PERFORMANCE_IND', 'MA_ROLL_IND', 'PROACTIVE_IND',
'MAK_CD', 'MDL_CD'],
dtype='object', length=668)
Why can't I see all 668 columns ?

Because you are changing Pandas pretty print, not how Python itself is truncating output.
For example: display.max_rows and display.max_columns sets the maximum number of rows and columns displayed when a frame is pretty-printed. Truncated lines are replaced by an ellipsis.
https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html#frequently-used-options
Instead of this, just do list(data.columns)
Without list()
With list()

Your solution works for me... (can scroll to last column)
import pandas as pd
import numpy as np
print(pd.__version__)
pd.set_option('display.max_columns', None)
df = pd.DataFrame(np.random.rand(10, 668))
df

Related

Dataframe conversion from pandas to polars -- difference in the final dimensions

I'm trying to convert a Pandas Dataframe to a Polar one.
I simply used the function result_polars = pl.from_pandas(result). Conversion proceeds well, but when I check the shape of the two dataframe I get that the Polars one has half the size of the original Pandas Dataframe.
I believe that 4172903059 in length is almost the maximum dimension that the polars dataframe allows.
Does anyone have suggestions?
Here a screenshot of the shape of the two dataframes.
Here a Minimum working example
import polars as pl
import pandas as pd
import numpy as np
df = pd.DataFrame(np.zeros((4292903069,1), dtype=np.uint8))
df_polars = pl.from_pandas(df)
Using these dimensions the two dataframes have the same size. If instead I put the following:
import polars as pl
import pandas as pd
import numpy as np
df = pd.DataFrame(np.zeros((4392903069,1), dtype=np.uint8))
df_polars = pl.from_pandas(df)
The Polars dataframe has much smaller dimension (97935773).
The default polars wheel retrieved with pip install polars "only" allows for 2^32 e.g. ~4.2 billion rows.
Do you need more than that install pip install polars-u64-idx and uninstall the previous installation.

Replacing whole string with part of it using Regex in Python Pandas

I have a table in pdf found on this link: https://taxation-customs.ec.europa.eu/system/files/2022-11/tobacco_products_releases-consumption.pdf
I am trying to clean the data before doing analysis but I noticed that between 2014-2017 the cigarette data was merged due to error. Instead of two cells per year in a column for Sweden and UK I got one merged which looks something like this: 5393688\r28587000
I would like to update data only for Sweden and get the first value before \r.
So far my code was as follows:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
cig= pd.DataFrame(tabula.read_pdf(r"https://taxation-customs.ec.europa.eu/system/files/2022-11/tobacco_products_releases-consumption.pdf", pages ='all')[0])
cig.replace(to_replace='N/A', value=0, inplace=True, regex=True)
cig= cig.replace(',','', regex=True)
After this I tried
df.iloc[26,:].str.replace("('\r').*","")
cig.iloc[26,:] = cig.iloc[26,:].replace("('\r').*","", regex=True)
and
cig.iloc[26,:].replace(to_replace='(?:[0-9]+)([^0-9]{2})([0-9]+)', value='', regex=True)
But none of the above seem to produce desired result and I still have values with similar format i.e. 5393688\r28587000
Set regex=True and assign the changed subset back to the dataframe:
df.iloc[26,:] = df.iloc[26,:].replace("('\r').*","", regex=True)

how to use pandas.concat insted of append [duplicate]

This question already has answers here:
How to replace pandas append with concat?
(3 answers)
Closed 4 months ago.
I have to import my excel files and combine them into one file.
I used below code and it's worked, but I got information "The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead"
I tried to use concat but it doest't work please help.
import numpy as np
import pandas as pd
import glob
all_data = pd.DataFrame()
for f in glob.glob(r'path\*.xlsx'):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
Use list comprehension instead loop with DataFrame.append:
all_data = pd.concat([pd.read_excel(f) for f in glob.glob(r'path\*.xlsx')],
ignore_index=True)

Importing Data with read_csv into DF [duplicate]

This question already has answers here:
How to read a file with a semi colon separator in pandas
(2 answers)
Closed 1 year ago.
I have tried to import csv via pandas. But df.head shows the data in wrong rows (see picture).
import numpy as np
import pandas as pd
df = pd.read_csv(r"C:\Users\micha\OneDrive\Dokumenty\ML\winequality-red.csv")
df.head()
Can you help me?
Seems like your data is not 'comma' seperated but 'semicolon' separated. Try adding this separator parameter.
df = pd.read_csv(r"C:\Users\micha\OneDrive\Dokumenty\ML\winequality-red.csv", sep=';')

What is the difference between doing a regression with a dataframe and ndarray?

I would like to know why would I need to convert my dataframe to ndarray when doing a regression, since I get the same result for intercept and coef when I do not convert it?
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
from sklearn import linear_model
%matplotlib inline
# import data and create dataframe
!wget -O FuelConsumption.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/FuelConsumptionCo2.csv
df = pd.read_csv("FuelConsumption.csv")
cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
# Split train/ test data
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]
# Modeling
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
**# if I use the dataframe, train[['ENGINESIZE']] for 'x', and train[['CO2EMISSIONS']] for 'y'
below, I get the same result**
regr.fit (train_x, train_y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)
Thank you very much!
So df is the loaded dataframe, cdf is another frame with selected columns, and train is selected rows.
train[['ENGINESIZE']] is a 1 column dataframe (I believe train['ENGINESIZE'] would be a pandas Series).
I believe the preferred syntax for getting an array from the dataframe is:
train[['ENGINESIZE']].values # or
train[['ENGINESIZE']].to_numpy()
though
np.asanyarray(train[['ENGINESIZE']])
is supposed to do the same thing.
Digging down through the regr.fit code I see that it calls sklearn.utils.check_X_y which in turn calls sklearn.tils.check_array. That takes care of converting the inputs to numpy arrays, with some awareness of pandas dataframe peculiarities (such as multiple dtypes).
So it appears that if fit accepts your dataframes, you don't need to convert them ahead of time. But if you can get a nice array from the dataframe, there's no harm in do that either. Either way the fit is done with arrays, derived from the dataframe.