How to stop Jupyter outputting truncated results when using pd.Series.value_counts()? - pandas

I have a DataFrame and I want to display the frequencies for certain values in a certain Series using pd.Series.value_counts().
The problem is that I only see truncated results in the output. I'm coding in Jupyter Notebook.
I have tried unsuccessfully a couple of methods:
df = pd.DataFrame(...) # assume df is a DataFrame with many columns and rows
# 1st method
df.col1.value_counts()
# 2nd method
print(df.col1.value_counts())
# 3rd method
vals = df.col1.value_counts()
vals # neither print(vals) doesn't work
# All output something like this
value1 100000
value2 10000
...
value1000 1
Currently this is what I'm using, but it's quite cumbersome:
print(df.col1.value_counts()[:50])
print(df.col1.value_counts()[50:100])
print(df.col1.value_counts()[100:150])
# etc.
Also, I have read this related Stack Overflow question, but haven't found it helpful.
So how to stop outputting truncated results?

If you want to print all rows:
pd.options.display.max_rows = 1000
print(vals)
If you want to print all rows only once:
with pd.option_context("display.max_rows", 1000):
print(vals)
Relevant documentation here.

I think you need option_context and set to some large number, e.g. 999. Advatage of solution is:
option_context context manager has been exposed through the top-level API, allowing you to execute code with given option values. Option values are restored automatically when you exit the with block.
#temporaly display 999 rows
with pd.option_context('display.max_rows', 999):
print (df.col1.value_counts())

Related

Query returns value that don't exist in PySpark Dataframe

Is there a way to create a subset dataframe from a dataframe and be sure that its values will be used afterward?
I have a huge PySpark Dataframe like this (simplified example):
id
timestamp
value
1
1658919602
5
1
1658919604
9
2
1658919632
2
Now I want to take a sample from it to test something, before running on the entire Dataframe. I get a sample by:
# Big dataframe
df = ...
# Create sample
df_sample = df.limit(10)
df_sample.show() shows some values.
Then I run this command, and sometimes it returns values that are present in df_sample and sometimes it returns values that are not present in df_sample but in df.
df_temp = df_sample.sort(F.desc('timestamp')).groupBy('id').agg(F.collect_list('value').alias('newcol'))
As if it's not using df_sample but picking in a non deterministic way 10 rows from df.
Interestingly, if I run df_sample.show() afterwards, it shows the same values as when it was first called.
Why is this happening?
Here's full code:
# Big dataframe
df = ...
# Create sample
df_sample = df.limit(10)
# shows some values
df_sample.show()
# run query
df_temp = df_sample.sort(F.desc('timestamp')).groupBy('id').agg(F.collect_list('value').alias('newcol')
# df_temp sometimes shows values that are present in df_sample, but sometimes shows values that aren't present in df_sample but in df
df_temp.show()
# Shows the exact same values as when it was first called
df_sample.show()
Edit1: I understand that Spark is lazy, but is there any way to force it to not be lazy in this scenario?
We can use sample function provided by spark to achieve this.Every time you run a sample() function it returns a different set of sampling records, To regenerate the same sample every time as you need to compare the results from your previous run. To get consistent same random sampling uses the same slice value for every run.
df=spark.range(100)
# Execute first time
print(df.sample(0.1,123).collect())
# Execute Second time with same seed-123
print(df.sample(0.1,123).collect())
# Execute with different seed-456
print(df.sample(0.1,456).collect())
Refer spark docs
Stratum sampling in spark
What worked was using df_sample = df.limit(10).cache() or df_sample = df.limit(10).persist(). Samkart's comment pointed me in this direction.

Appending data frames or expanding with for-loop: which is better?

I have the following 3 dataframes:
I want to append df_forecast to each of df2_CA and df2_USA using a for-loop. However when I run my code, df_forecast is not appending: df2_CA and df2_USA appear exactly as shown above.
df_list=[df2_CA, df2_USA]
for x in df_list:
pd.concat([x, df_forecast])
x['prod'].fillna(method='ffill',inplace=True)
x['country'].fillna(method='ffill',inplace=True)
x.loc['2020-03':'2020-05', 'rev'] = 200
Desired result:
Should I append df_forecast to each of df2_CA and df_USA or should I should the for-loop to expand the date range without creating df_forecast and if so, how do you code it?

splitting columns with str.split() not changing the outcome

Will I have to use the str.split() for an exercise. I have a column called title and it looks like this:
and i need to split it into two columns Name and Season, the following code does not through an error but it doesn't seem to be doing anything as well when i'm testing it with df.head()
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
Any help as to why?
The code you have in your question is correct, and should be working. The issue could be coming from the execution order of your code though, if you're using Jupyter Notebook or some method that allows for unclear ordering of code execution.
I recommend starting a fresh kernel/terminal to clear all variables from the namespace, then executing those lines in order, e.g.:
# perform steps to load data in and clean
print(df.columns)
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
print(df.columns)
Alternatively you could add an assertion step in your code to ensure it's working as well:
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
assert {'Name', 'Season'}.issubset(set(df.columns)), "Columns were not added"

How do I use values within a list to specify changing selection conditions and export paths?

I'm trying to split a large csv data using a condition. To automate this process, I'm pulling a list of unique conditions from a column in the data set and wanting to use this list within a loop to specify condition and also rename the export file.
I've converted the array of values into a list and have tried fitting my function into a loop, however, I believe syntax is the main error.
# df1718 is my df
# znlist is my list of values (e.g. 0 1 2 3 4)
# serial is specified at the top e.g. '4'
for x in znlist:
dftemps = df1718[(df1718.varname == 'RoomTemperature') & (df1718.zone == x)]
dftemps.to_csv('E:\\path\\test%d_zone(x).csv', serial)
So in theory, I would like each iteration to export the data relevant to the next zone in the list and the export file to be named test33_zone0.csv (for example). Thanks for any help!
EDIT:
The error I am getting is: "delimiter" must be string, not int
So if the error is in saving the file try this
dftemps.to_csv('E:\\path\\test{}_zone{}.csv'.format(str(serial),str(x)))

pseudo randomization in loop PsychoPy

I know other people have asked similar questions in past but I am still stuck on how to solve the problem and was hoping someone could offer some help. Using PsychoPy, I would like to present different images, specifically 16 emotional trials, 16 neutral trials and 16 face trials. I would like to pseudo randomize the loop such that there would not be more than 2 consecutive emotional trials. I created the experiment in Builder but compiled a script after reading through previous posts on pseudo randomization.
I have read the previous posts that suggest creating randomized excel files and using those, but considering how many trials I have, I think that would be too many and was hoping for some help with coding. I have tried to implement and tweak some of the code that has been posted for my experiment, but to no avail.
Does anyone have any advice for my situation?
Thank you,
Rae
Here's an approach that will always converge very quickly, given that you have 16 of each type and only reject runs of more than two emotion trials. #brittUWaterloo's suggestion to generate trials offline is very good--this what I do myself typically. (I like to have a small number of random orders, do them forward for some subjects and backwards for others, and prescreen them to make sure there are no weird or unintended juxtapositions.) But the algorithm below is certainly safe enough to do within an experiment if you prefer.
This first example assumes that you can represent a given trial using a string, such as 'e' for an emotion trial, 'n' neutral, 'f' face. This would work with 'emo', 'neut', 'face' as well, not just single letters, just change eee to emoemoemo in the code:
import random
trials = ['e'] * 16 + ['n'] * 16 + ['f'] * 16
while 'eee' in ''.join(trials):
random.shuffle(trials)
print trials
Here's a more general way of doing it, where the trial codes are not restricted to be strings (although they are strings here for illustration):
import random
def run_of_3(trials, obj):
# detect if there's a run of at least 3 objects 'obj'
for i in range(2, len(trials)):
if trials[i-2: i+1] == [obj] * 3:
return True
return False
tr = ['e'] * 16 + ['n'] * 16 + ['f'] * 16
while run_of_3(tr, 'e'):
random.shuffle(tr)
print tr
Edit: To create a PsychoPy-style conditions file from the trial list, just write the values into a file like this:
with open('emo_neu_face.csv', 'wb') as f:
f.write('stim\n') # this is a 'header' row
f.write('\n'.join(tr)) # these are the values
Then you can use that as a conditions file in a Builder loop in the regular way. You could also open this in Excel, and so on.
This is not quite right, but hopefully will give you some ideas. I think you could occassionally get caught in an infinite cycle in the elif statement if the last three items ended up the same, but you could add some sort of a counter there. In any case this shows a strategy you could adapt. Rather than put this in the experimental code, I would generate the trial sequence separately at the command line, and then save a successful output as a list in the experimental code to show to all participants, and know things wouldn't crash during an actual run.
import random as r
#making some dummy data
abc = ['f']*10 + ['e']*10 + ['d']*10
def f (l1,l2):
#just looking at the output to see how it works; can delete
print "l1 = " + str(l1)
print l2
if not l2:
#checks if second list is empty, if so, we are done
out = list(l1)
elif (l1[-1] == l1[-2] and l1[-1] == l2[0]):
#shuffling changes list in place, have to copy it to use it
r.shuffle(l2)
t = list(l2)
f (l1,t)
else:
print "i am here"
l1.append(l2.pop(0))
f(l1,l2)
return l1
You would then run it with something like newlist = f(abc[0:2],abc[2:-1])