Pyspark dataframe: crosstab or other method to make row label as new columns - dataframe

I have a pyspark dataframe as follows in the picture:
I.e. i have four columns: year, word, count, frequency. The year is from 2000 to 2015.
I could like to have some operation on the (pyspark) dataframe so that i get the result in a format as the following picture:
The new dataframe column should be : word, frequency_2000, frequency_2001, frequency_2002, ..., frequency_2015.
With the frequency of each word in each year coming from previous dataframe.
Any advice how I could write efficient code?
Also, please rename the title if you could come up some more informative.

After some research, I found a solution:

Now, the crosstab function can get the output directly :
topw_ys.crosstab("word", "year").toPandas()
Results:
word_year 2000 2015
0 mining 10 6
1 system 11 12
...

Related

Selecting Rows Based On Specific Condition In Python Pandas Dataframe

So I am new to using Python Pandas dataframes.
I have a dataframe with one column representing customer ids and the other holding flavors and satisfaction scores that looks something like this.
Although each customer should have 6 rows dedicated to them, Customer 1 only has 5. How do I create a new dataframe that will only print out customers who have 6 rows?
I tried doing: df['Customer No'].value_counts() == 6 but it is not working.
Here is one way to do it
if you post data as a code (preferably) or text, i would be able to share the result
# create a temporary column 'c' by grouping on Customer No
# and assigning count to it using transform
# finally, using loc to select rows that has a count eq 6
(df.loc[df.assign(
c=df.groupby(['Customer No'])['Customer No']
.transform('count'))['c'].eq(6]
)

Extract columns with range in Google Colab

I want to extract some columns maybe just 10 from a datafram with 30 columns but I'm not finding any code or functions to do it, I tried with iloc but not good results at all, help please here is my data frame:
So I just want to get the columns 1 to 10:
df1_10 = df.columns['1'....'10']
If you want to fetch 10 columns from your dataset then use this piece of code
df.iloc[:,1:11] # this will give you 10 columns
df.iloc[:,1:10] # this will give you only 9 columns.
# This is what you use in your code that's why you don't get the desired result.

validate two data columns from difrnt source dataframes in databricks, if data matched(record counts) row wise , then excute the command or else error

dataframe -1:
created year, rec_counts
2016 50
2015 40
Dataframe -2:
created year, rec_counts
2016 1000
2015 47
There are 2 methods you can try.
Let's assume the names of two DataFrames are df1 and df2.
Now, if you just want to count the number of rows and check if both has same row count or not, use df1.count() and df2.count() and check if both gives the same output (total number of rows in each group).
Secondly, you can write statement df2.except(df1) and this will return the complete rows which haven't present in other dataframe. If it returns NULL, it means both dataframes are same.

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

How to group by and sum several columns?

I have a big dataframe with several columns which contains strings, numbers, etc. I am trying to group by SCENARIO and then sum only the columns between 2020 and 2050. The only thing I have got so far is sum one column as displayed as follows, but I need to change this '2050' by the columns between 2020 and 2050, for instance.
df1 = df.groupby(["SCENARIO"])['2050'].sum().sum(axis=0)
You are creating a subset of the df with only that single column. I can't tell how your dataset looks like from the information provided, but try:
df.groupby(["SCENARIO"]).sum()
This should some up all the rows which are in the column.
Alternatively select the columns which you want to perform the summation on.
df.groupby(["SCENARIO"])[["column1","column2"]].sum()