My data is like this:
year_month
user_id
pageviews
visits
2020-03
2
8
3
2021-03
27
4
3
2021-05
23
75
7
2020-05
23
17
7
2020-08
339
253
169
2020-08
892
31
4
2021-08
339
4
3
And I wanted to group by year_month calculating the difference of pageviews and visits from one year(2020) to the next(2021).
So, I was thinking the output should be something similar to (without the content inside the parenthesis):
last_month
diff(pageviews)
diff(visits)
2021-03
-4(4-8)
0(3-3)
2021-05
58(75-17)
0(7-7)
2021-08
-280(4-284)
-170(3-173)
But I'm not sure how to do it vectorized, I was thinking of passing it to pandas and doing it with a for loop, but wanted to learn how to do this kind of things in a vectorized way with pyspark or sparksql that I think they will be much faster.
The main idea is using window function to compare months. Check my comments for more explanations
from pyspark.sql import functions as F
from pyspark.sql import Window as W
(df
# since you'd want to compare month and year separately,
# we have to separate them out using split function
.withColumn('year', F.split('year_month', '-')[0].cast('int'))
.withColumn('month', F.split('year_month', '-')[1].cast('int'))
# you have multiple rows per year_month
# so we have to group and sum the similar records
.groupBy('year', 'month')
.agg(
F.sum('pageviews').alias('pageviews'),
F.sum('visits').alias('visits')
)
# now, you need to compare 2021's months with 2020's months,
# you'd have to use lag window function, pay attention to the orderBy window
.withColumn('prev_pageviews', F.lag('pageviews').over(W.orderBy('month', 'year')))
.withColumn('prev_visits', F.lag('visits').over(W.orderBy('month', 'year')))
# with current pageviews/visits and previous pageviews/visits on the same row
# you can easily calculate the difference between months
.withColumn('diff_pageviews', F.col('pageviews') - F.col('prev_pageviews'))
.withColumn('diff_visits', F.col('visits') - F.col('prev_visits'))
# select only necessary colums and rows
.select('year', 'month', 'diff_pageviews', 'diff_visits')
.where(F.col('year') == 2021)
.show()
)
# Output
# +----+-----+--------------+-----------+
# |year|month|diff_pageviews|diff_visits|
# +----+-----+--------------+-----------+
# |2021| 3| -4| 0|
# |2021| 5| 58| 0|
# |2021| 8| -280| -170|
# +----+-----+--------------+-----------+
Related
I have a huge dataset over different years. As a subsample for local tests, I need to separate a small dataframe which contains only a few samples distributed over years. Does anyone have any idea how to do that?
After groupby by 'year' column, the count of instances in each year is something like:
year
A
1838
1000
1839
2600
1840
8900
1841
9900
I want to select a subset which after groupby looks like:
| year| A |
| ----| --|
| 1838| 10|
| 1839| 10|
| 1840| 10|
| 1841| 10|
Try groupby().sample().
Here's example usage with dummy data.
import numpy as np
import pandas as pd
# create a long array of 'years' from 1800 to 1805
years = np.random.randint(low=1800,high=1805,size=200)
values = np.random.randint(low=1, high=200,size=200)
df = pd.DataFrame({'Years':years,"Values":values})
number_per_year = 10
sample_df = df.groupby("Years").sample(n=number_per_year, random_state=1)
I have a dataframe that looks like below, the date is the index. How would I plot a time series showing a line for each of the years? I have tried df.plot(figsize=(15,4)) but this gives me one line.
Date Value
2008-01-31 22
2008-02-28 17
2008-03-31 34
2008-04-30 29
2009-01-31 33
2009-02-28 42
2009-03-31 45
2009-04-30 39
2019-01-31 17
2019-02-28 12
2019-03-31 11
2019-04-30 12
2020-01-31 24
2020-02-28 34
2020-03-31 43
2020-04-30 45
You can just do a groupby using year.
df = pd.read_clipboard()
df = df.set_index(pd.DatetimeIndex(df['Date']))
df.groupby(df.index.year)['Value'].plot()
In case you want to use year as series of data and compare day to day:
import matplotlib.pyplot as plt
# Create a date column from index (easier to manipulate)
df["date_column"] = pd.to_datetime(df.index)
# Create a year column
df["year"] = df["date_column"].dt.year
# Create a month-day column
df["month_day"] = (df["date_column"].dt.month).astype(str).str.zfill(2) + \
"-" + df["date_column"].dt.day.astype(str).str.zfill(2)
# Plot. Pivot will create for each year a column and these columns will be used as series.
df.pivot('month_day', 'year', 'Value').plot(kind='line', figsize=(12, 8), marker='o' )
plt.title("Values per Month-Day - Year comparison", y=1.1, fontsize=14)
plt.xlabel("Month-Day", labelpad=12, fontsize=12)
plt.ylabel("Value", labelpad=12, fontsize=12);
I have data of all the Months from Jan to Dec for population for particular year and I have one column say "Constant" and I need to multiply that constant column value with all the columns data from Jan to Dec in spark. For Example, I have following data :
JAN FEB MAR...DEC Constant
City1 160 158 253 391 12
City2 212 27 362 512 34
City3 90 150 145 274 56
After multiplication, I want new/replace dataframe with values :
JAN FEB MAR ....DEC
City1 192 1896 3036 1656
City2 7208 918 12308 8092
City3 504 280 8120 2464
I am able to do it by one column at a time with the code :
Df.select("JAN","CONSTANT").withColumn("JAN",col('JAN') * col ('CONSTANT')).show()
Is there any function/loop where i can get the entire column multiplication and new dataframe values all months?
You could express your logic using a struct of structs. Structs are basically the same as a column in higher order, so we can assign them a name, multiply them by constant, and then select them using columnname.*. This way you dont have to do withColumn 12 times. You could put all your months in listofmonths.
df.show() #sampledata
#+-----+---+---+---+---+--------+
#| City|JAN|FEB|MAR|DEC|Constant|
#+-----+---+---+---+---+--------+
#|City1|160|158|253|391| 12|
#|City2|212| 27|362|512| 34|
#|City3| 90|150|145|274| 56|
#+-----+---+---+---+---+--------+
listofmonths=['JAN','FEB','MAR','DEC']
from pyspark.sql import functions as F
df.withColumn("arr", F.struct(*[(F.col(x)*F.col('Constant')).alias(x) for x in listofmonths]))\
.select("City","arr.*")\
.show()
#+-----+----+----+-----+-----+
#| City| JAN| FEB| MAR| DEC|
#+-----+----+----+-----+-----+
#|City1|1920|1896| 3036| 4692|
#|City2|7208| 918|12308|17408|
#|City3|5040|8400| 8120|15344|
#+-----+----+----+-----+-----+
You could also just use df.columns instead of listofmonths like this:
from pyspark.sql import functions as F
df.withColumn("arr", F.struct(*[(F.col(x)*F.col('Constant')).alias(x) for x in df.columns if x!='City' and x!='Constant']))\
.select("City","arr.*")\
.show()
My 500 data frames look like this, it is a day based data for 2 years.
Date | Column A | Column B
2017-04-04
2017-04-05
2017-04-06
2017-04-07
....
2017-04-02
...
2019-02-01
2019-02-11
2019-02-22
2019-02-27
2019-03-01
2019-04-01
2019-05-01
All the data frames have a similar number of columns, but a different number of rows. All these DataFrames have a few similar timestamps. I want to exact common timestamps from all my data frames.
The goal is to filter out common timestamps in all my 500 data frames and create a subset of new 500 data frames with just common timestamps.
If you can store all 500 in memory at once, then it's useful to store them in a dictionary. Then you can find the intersection of all dates, and then save the subsets:
import pandas as pd
from functools import reduce
d = dict((file, pd.read_csv(file)) for file in [your_list_of_files])
date_com = reduce(lambda l,r: l & r [set(df.Date) for _,df in d.items()])
for file,df in d.items():
df[df.Date.isin(date_com)].to_csv(f'adjusted_{file}')
Some illustrative data in a DataFrame (MultiIndex) format:
|entity| year |value|
+------+------+-----+
| a | 1999 | 2 |
| | 2004 | 5 |
| b | 2003 | 3 |
| | 2007 | 2 |
| | 2014 | 7 |
I would like to calculate the slope using scipy.stats.linregress for each entity a and b in the above example. I tried using groupby on the first column, following the split-apply-combine advice, but it seems problematic since it's expecting one Series of values (a and b), whereas I need to operate on the two columns on the right.
This is easily done in R via plyr, not sure how to approach it in pandas.
A function can be applied to a groupby with the apply function. The passed function in this case linregress. Please see below:
In [4]: x = pd.DataFrame({'entity':['a','a','b','b','b'],
'year':[1999,2004,2003,2007,2014],
'value':[2,5,3,2,7]})
In [5]: x
Out[5]:
entity value year
0 a 2 1999
1 a 5 2004
2 b 3 2003
3 b 2 2007
4 b 7 2014
In [6]: from scipy.stats import linregress
In [7]: x.groupby('entity').apply(lambda v: linregress(v.year, v.value)[0])
Out[7]:
entity
a 0.600000
b 0.403226
You can do this via the iterator ability of the group by object. It seems easier to do it by dropping the current index and then specifying the group by 'entity'.
A list comprehension is then an easy way to quickly work through all the groups in the iterator. Or use a dict comprehension to get the labels in the same place (you can then stick the dict into a pd.DataFrame easily).
import pandas as pd
import scipy.stats
#This is your data
test = pd.DataFrame({'entity':['a','a','b','b','b'],'year':[1999,2004,2003,2007,2014],'value':[2,5,3,2,7]}).set_index(['entity','year'])
#This creates the groups
groupby = test.reset_index().groupby(['entity'])
#Process groups by list comprehension
slopes = [scipy.stats.linregress(group.year, group.value)[0] for name, group in groupby]
#Process groups by dict comprehension
slopes = {name:[scipy.stats.linregress(group.year, group.value)[0]] for name, group in groupby}