Pandas plot data column on x axis - pandas

I am trying to plot data using pandas. The data is as follows
Name 1999 2000 2001
stud1 11 22 33
stud2 33 44 55
stud3 55 66 77
......
I need to plot student mark year-wise (year is on x-axis).

You can do this this way
stud = pd.read_csv(r"C:/users/k_sego/students.csv", sep=";")
df = stud.pivot_table(columns=['Name'])
df.plot(kind='bar', legend=True)

you could try this:
df.pivot_table(columns=['Name']).plot()
It'll pivot your dataframe so that the year is the index and each student is a column

Related

How can plot line graph for each country that takes place in the column?

I want to plot three lines for Turkey, the UK and OECD through the years but those countries are not columns so I am suffering finding a way plot them.
I get this df via
df = df.loc[df["Variable"].eq("Relative advantage")
& df["Country"].isin(["United Kingdom", "Türkiye", "OECD - Total"])]
Year
Country
Value
1990
Turkiye
20
1980
UK
34
1992
UK
32
1980
OECD
29
1992
OECD
23
You can use the pivot_table() method to do this. An example:
import pandas as pd
# Set up example dataframe
df = pd.DataFrame([
[1990,'Turkiye',20],
[1992,'Turkiye',22],
[1990,'UK',34],
[1992,'UK',32],
[1990,'OECD',29],
[1992,'OECD',23],
], columns=["year", "country", "value"])
# Pivot so countries are now columns
table = df.pivot_table(values='value', columns='country', index='year')
This creates a dataframe where the countries are columns:
country OECD Turkiye UK
year
1990 29 20 34
1992 23 22 32
(I changed some of the dates to make it work out a bit more nicely.)
Then I plot it:

How to plot time series and group years together?

I have a dataframe that looks like below, the date is the index. How would I plot a time series showing a line for each of the years? I have tried df.plot(figsize=(15,4)) but this gives me one line.
Date Value
2008-01-31 22
2008-02-28 17
2008-03-31 34
2008-04-30 29
2009-01-31 33
2009-02-28 42
2009-03-31 45
2009-04-30 39
2019-01-31 17
2019-02-28 12
2019-03-31 11
2019-04-30 12
2020-01-31 24
2020-02-28 34
2020-03-31 43
2020-04-30 45
You can just do a groupby using year.
df = pd.read_clipboard()
df = df.set_index(pd.DatetimeIndex(df['Date']))
df.groupby(df.index.year)['Value'].plot()
In case you want to use year as series of data and compare day to day:
import matplotlib.pyplot as plt
# Create a date column from index (easier to manipulate)
df["date_column"] = pd.to_datetime(df.index)
# Create a year column
df["year"] = df["date_column"].dt.year
# Create a month-day column
df["month_day"] = (df["date_column"].dt.month).astype(str).str.zfill(2) + \
"-" + df["date_column"].dt.day.astype(str).str.zfill(2)
# Plot. Pivot will create for each year a column and these columns will be used as series.
df.pivot('month_day', 'year', 'Value').plot(kind='line', figsize=(12, 8), marker='o' )
plt.title("Values per Month-Day - Year comparison", y=1.1, fontsize=14)
plt.xlabel("Month-Day", labelpad=12, fontsize=12)
plt.ylabel("Value", labelpad=12, fontsize=12);

Normalisation or scaling of a column in pyspark

I want to scale a particular column in pyspark. In this case i want to do scaling in results column.My data frame looks like -
id age results
1 28 98
2 27 12
3 28 99
4 28 5
5 27 54
I have done so far -
df = spark.createDataFrame(
[(1,28,98),(2,27,12),(3,28,99),(4,28,5),(5,27,54)],
("id","age","results"))
minmax_result = df.groupBy("id").agg(min("results").alias("min_results"),max("results").alias("max_results))
final_df = minmax_result.join(df,["id"]).select(
((col("results") - col("min_results")) / col("min_results"))).alias("scaled_results"))
final_df.show()
it gives me like -
id age results scaled_results
1 28 98 null
2 27 12 null
3 28 99 null
4 28 5 null
5 27 54 null
I'm assuming you're planning to scale the column across all ids, so you won't be needing the groupby operation, unless you're going the UDF route. I'd suggest going with the following:
min = df.agg({"results": "min"}).collect()[0][0]
max = df.agg({"results": "max"}).collect()[0][0]
df_scaled = df_test.withColumn('scaled_results', (col('results') - min)/max)
I presume you're dividing each cell by the min value instead of the max value by mistake, but that might the use case as well.
you can use StandardScaler function in Pyspark Mllib something like this :
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
withStd=True, withMean=False)
scalerModel = scaler.fit(new_df)
scaledData = scalerModel.transform(new_df)
Refer : https://spark.apache.org/docs/latest/mllib-feature-extraction.html
Upvote if works

how to calculate percentage changes across 2 columns in a dataframe using pct_change in Python

I have a dataframe and want to use pct_chg method to calculate the % change between only 2 of the selected columns, B and C, and put the output into a new column. the below code doesnt seem to work. can anyone help me?
df2 = pd.DataFrame(np.random.randint(0,50,size=(100, 4)), columns=list('ABCD'))
df2['new'] = df2.pct_change(axis=1)['B']['C']
Try:
df2['new'] = df2[['B','C']].pct_change(axis=1)['C']
pct_change returns pct_change across all the columns, you can select the required column and assign to a new variable.
df2['new'] = df2.pct_change(axis=1)['C']
A B C D new
0 29 4 29 5 6.250000
1 14 35 2 40 -0.942857
2 5 18 31 10 0.722222
3 17 10 42 41 3.200000
4 24 48 47 35 -0.020833
IIUC, you can just do the following:
df2['new'] = (df2['C']-df2['B'])/df2['B']

Pandas pivot_table -- index values extended through rows

I'm trying to tidy some data, specifically by taking two columns "measure" and "value" and making more columns for each unique value of measure.
So far I have some python (3) code that reads in data and pivots it to the form that I want--roughly. This code looks like so:
import pandas as pd
#Load the data
df = pd.read_csv(r"C:\Users\User\Documents\example data.csv")
#Pivot the dataframe
df_pivot = df.pivot_table(index=['Geography Type', 'Geography Name', 'Week Ending',
'Item Name'], columns='Measure', values='Value')
print(df_pivot.head())
This outputs:
Measure X Y Z
Geography Type Geography Name Week Ending Item Name
Type 1 Total US 1/1/2018 Item A 57 51 16
Item B 95 37 17
1/8/2018 Item A 92 8 32
Item B 36 49 54
Type 2 Region 1 1/1/2018 Item A 78 46 88
This is almost perfect, but for my work I need to put this file in software and for the software to read the data correctly it needs values for each of the rows, so I need the columns values for each of those indexes to extend through the rows, like so:
Measure X Y Z
Geography Type Geography Name Week Ending Item Name
Type 1 Total US 1/1/2018 Item A 57 51 16
Type 1 Total US 1/1/2018 Item B 95 37 17
Type 1 Total US 1/8/2018 Item A 92 8 32
Type 1 Total US 1/8/2018 Item B 36 49 54
Type 2 Region 1 1/1/2018 Item A 78 46 88
and so on.