The options.display.max_rows in pandas only shows a summary when truncation happens - pandas

I have a dataset with 200 rows. When setting
pd.options.display.max_rows = 200
we see all rows in a scrollable area:
But if we set it to less than the full dataset -thus requiring truncation - then we get a summary . and only 10 rows ?
pd.options.display.max_rows = 100
How can the options be set to really display 100 rows?

It appears that this behavior is intended to be controlled as much as it can be by display.large_repr:
display.large_repr : 'truncate'/'info'
For DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can
show a truncated table (the default from 0.13), or switch to the view from
df.info() (the behaviour in earlier versions of pandas).
[default: truncate] [currently: truncate]
The default truncate gives the present behavior. info does not even give any data at all:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Columns: 3 entries, column to column_val_cnt
dtypes: int64(1), object(2)
memory usage: 4.8+ KB
So there does not appear to be any way to achieve what I'm looking for.
Reference: Pandas Display Configurations
Btw I also looked into an indirect way to get what I was looking for
from IPython.display import display, HTML, Markdown
This did not work either. I'm getting close to "calling it a day" on this feature as "not supported".

Related

Using a multiindex to print out one specific row

I am working with a csv file that shows port trade of different commodities. It has a multiindex with the indexes being 'PORT', 'I_COMMODITY', and 'I_COMMODITY_SDSEC' (In that order). I am being asked to print one specific row out of this dataframe.
The question is --> Print out the row that corresponds to port = 3002 and commodity code = 95 (toys and games).
How do I go about printing out this one specific row? I've tried using the .xs() and .loc[] methods but it keeps returning my key error as being '95'.
trade.xs('95', level='I_COMMODITY')
I_COMMODITY is an int64 and the other 2 are objects
I am a newbie coder and am taking my first coding course so any help would be much appreciated.
to return
PORT I_COMMODITY I_COMMODITY
3002 95 -

Python KeyError when using pandas

I'm following a tutorial on NLP but have encountered a key error error when trying to group my raw data into good and bad reviews. Here is the tutorial link: https://towardsdatascience.com/detecting-bad-customer-reviews-with-nlp-d8b36134dc7e
#reviews.csv
I am so angry about the service
Nothing was wrong, all good
The bedroom was dirty
The food was great
#nlp.py
import pandas as pd
#read data
reviews_df = pd.read_csv("reviews.csv")
# append the positive and negative text reviews
reviews_df["review"] = reviews_df["Negative_Review"] +
reviews_df["Positive_Review"]
reviews_df.columns
I'm seeing the following error:
File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Negative_Review'
Why is this happening?
You're getting this error because you did not understand how to structure your data.
When you do df['reviews']=df['Positive_reviews']+df['Negative_reviews'] you're actually summing the values of Positive reviews to Negative reviews(which does not exist currently) into the 'reviews' column (chich also does not exist).
Your csv is nothing more than a plaintext file with one text in each row. Also, since you're working with text, remember to enclose every string in quotation marks("), otherwise your commas will create fakecolumns.
With your approach, it seems that you'll still tag all your reviews manually (usually, if you're working with machine learning, you'll do this outside code and load it to your machine learning file).
In order for your code to work, you want to do the following:
import pandas as pd
df = pd.read_csv('TestFileFolder/57886076.csv', names=['text'])
## Fill with placeholder values
df['Positive_review']=0
df['Negative_review']=1
df.head()
Result:
text Positive_review Negative_review
0 I am so angry about the service 0 1
1 Nothing was wrong, all good 0 1
2 The bedroom was dirty 0 1
3 The food was great 0 1
However, I would recommend you to have a single column (is_review_positive) and have it to true or false. You can easily encode it later on.

How to stop Jupyter outputting truncated results when using pd.Series.value_counts()?

I have a DataFrame and I want to display the frequencies for certain values in a certain Series using pd.Series.value_counts().
The problem is that I only see truncated results in the output. I'm coding in Jupyter Notebook.
I have tried unsuccessfully a couple of methods:
df = pd.DataFrame(...) # assume df is a DataFrame with many columns and rows
# 1st method
df.col1.value_counts()
# 2nd method
print(df.col1.value_counts())
# 3rd method
vals = df.col1.value_counts()
vals # neither print(vals) doesn't work
# All output something like this
value1 100000
value2 10000
...
value1000 1
Currently this is what I'm using, but it's quite cumbersome:
print(df.col1.value_counts()[:50])
print(df.col1.value_counts()[50:100])
print(df.col1.value_counts()[100:150])
# etc.
Also, I have read this related Stack Overflow question, but haven't found it helpful.
So how to stop outputting truncated results?
If you want to print all rows:
pd.options.display.max_rows = 1000
print(vals)
If you want to print all rows only once:
with pd.option_context("display.max_rows", 1000):
print(vals)
Relevant documentation here.
I think you need option_context and set to some large number, e.g. 999. Advatage of solution is:
option_context context manager has been exposed through the top-level API, allowing you to execute code with given option values. Option values are restored automatically when you exit the with block.
#temporaly display 999 rows
with pd.option_context('display.max_rows', 999):
print (df.col1.value_counts())

Is there a way to speed up this webscraping iteration? Pandas

So I'm collecting data on a list of stocks and putting all that info into a dataframe. The list has about 700 stocks.
import pandas as pd
stock =['adma','aapl','fb'] # list has about 700 stocks which I extracted from a pickled dataframe that was storing the info.
#The site I'm visiting is below with the name of the stock added to the end of the end of the link
##http://finviz.com/quote.ashx?t=adma
##http://finviz.com/quote.ashx?t=aapl
I'm just extracting one portion of that site, evident by [-2] in the code below
df2 = pd.DataFrame()
for i in stock:
df = pd.read_html('http://finviz.com/quote.ashx?t={}'.format(i), header =0)[-2].set_index('SEC Form 4')
df['Stock'] = i.upper() # creating a column which has the name of the stock, so I can differentiate between stocks
df2 = df2.append(df)
It feels like I'm doing a few seconds per iteration and I have around 700 to go through at the moment. It's not terribly slow, but I was just curious if there is a more efficient method. Thanks.
Your current code is blocking, you don't proceed with retrieving the information from the next url until you are done with the current. Instead, you can switch to, for example, Scrapy which is based on twisted and working asynchronously processing multiple pages at the same time.

Importing timeseries datasets to MATLAB (all values are displayed as NaN)

I am stuck trying to run an economic model using MATLAB - at the data importing part. For most of my code I'm using a freeware toolbox called IRIS.
I have quarterly dataset with 14 variables and 160 datapoints. Essentially the dataset is a 15X161 matrix- including the dates(col1) and variable names(B1:O1).
The command used for uploading data on IRIS is
d = dbload('filename.csv')
but this isn't working. Although MATLAB is creating a 1X1 array called d and creating fields under it (one for each variable). All cells display NaN - not a number.
Why is this happening?
I checked the tutorials on the IRIS toolbox website and tried running and loading a sample dataset from there using this command, but it leads to the same problem. Everywhere I checked- including MATLAB help, this seems to be the correct command to use when using IRIS, but somehow it isn't working.
I also tried uploading the data directly using MATLAB functions and not IRIS. The command I'm using is:
d = dataset('XLSFile','filename.xls','ReadVarNames', true).
Although this is working, and I can see all the variable names, but MATLAB can't read the dates. I tried xlsread and importdata as well, but they don't read the variable names. Is there any way for me to upload the entire Excel sheet with the variable names and dates?
It would be best if I could get the IRIS command to work, since the rest of my code would be compatible with that.
The dataset looks somewhat like this..
HO_GDP HO_CPI HO_CPI HO_RS HO_ER HO_POIL....
4/1/1970 82.33 85.01 55.00 99.87 08.77
7/1/1970 54.22 8.98 25.22 95.11 91.77
10/1/1970 85.41 85.00 85.22 95.34 55.00
1/1/1971 85.99 899 8.89 85.1
You can use the TEXTSCAN function to read the CSV file in MATLAB:
%# some options
numCols = 15; %# number of columns
opts = {'Delimiter',',', 'MultipleDelimsAsOne',true, 'CollectOutput',true};
%# open file for reading
fid = fopen('filename.csv','rt');
%# read header line
headers = textscan(fid, repmat('%s',1,numCols), 1, opts{:});
%# read rest of data rows
%# 1st column as string, the other 14 as floating point
data = textscan(fid, ['%s' repmat('%f',1,numCols-1)], opts{:});
%# close file
fclose(fid);
%# collect data
headers = headers{1};
data = [datenum(data{1},'mm/dd/yyyy') data{2}];
The result for the above sample you posted (assuming values are comma-separated):
>> headers
headers =
'HO_GDP' 'HO_CPI' 'HO_CPI' 'HO_RS' 'HO_ER' 'HO_POIL'
>> data
data =
7.1962e+05 82.33 85.01 55 99.87 8.77
7.1971e+05 54.22 8.98 25.22 95.11 91.77
7.198e+05 85.41 85 85.22 95.34 55
7.1989e+05 85.99 899 8.89 85.1 0
Note how in the last line of the code we convert the date column to serial date number, so that we can store the entire data in one numeric matrix. You can always go back to string representation of dates using DATESTR function:
>> datestr(data(:,1))
ans =
01-Apr-1970
01-Jul-1970
01-Oct-1970
01-Jan-1971