How to replace dates of 1 month to month in python? - pandas

Above is the image of the below dataframe with x-axis as date and y-axis as High
what I want is for date between 06-09-21 to 31-09-21 it should replace it with sep 21 and likewise remaining dates with respected months in graph as right now the x-axis is not readable
I don't even know where to start with
Below is the code that I used to draw/plot graph
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv("Stock.csv")
x1=df['High'].values.tolist()
r=df['Date'].values.tolist()
plt.plot(r, x1,color="green", label = "High")

You can use pandas.to_datetime with pandas.Series.dt.strftime.
Try this :
df["Date"] = pd.to_datetime(df["Date"], dayfirst=True).dt.strftime("%b-%d")
# Output :
As a new column to illustrate the change :
print(df)
Date Date(new)
0 08-09-21 Sep-08
1 09-09-21 Sep-09
2 13-09-21 Sep-13
3 30-08-21 Aug-30
4 01-09-22 Sep-01
5 02-09-22 Sep-02
6 05-09-22 Sep-05
7 06-09-22 Sep-06
# Edit :
You can use matplotlib.dates.DateFormatter :
df["Date"] = pd.to_datetime(df["Date"], dayfirst=True)
f, ax = plt.subplots(figsize = [8, 4])
ax.plot(df["Date"], df["High"])
ax.xaxis.set_major_formatter(DateFormatter("%b-%Y"))
Used input :
Date Open High Low Close AdjClose Volume
0 06-09-21 1579.949951 1580.949951 1561.949951 1565.699951 1547.704712 3938448
1 07-09-21 1562.500000 1582.000000 1555.199951 1569.250000 1551.213989 3622748
2 08-09-21 1571.949951 1580.500000 1565.599976 1576.400024 1558.281860 3362040
3 09-09-21 1574.000000 1579.449951 1561.000000 1568.599976 1550.571411 4125474
4 13-09-21 1562.000000 1584.000000 1553.650024 1555.550049 1537.671509 4479582
5 30-08-22 1446.449951 1489.949951 1443.099976 1486.099976 1486.099976 5067700
6 01-09-22 1464.750000 1489.449951 1459.000000 1472.150024 1472.150024 11201568
7 02-09-22 1472.150024 1490.500000 1465.199951 1485.500000 1485.500000 6019043
8 05-09-22 1486.099976 1499.000000 1484.099976 1495.050049 1495.050049 6065966
9 06-09-22 1498.900024 1506.650024 1487.099976 1502.000000 1502.000000 4066957

Related

Create a bar plot in plt when having the bins and heights

I have the following ranges of bins and their desired heights
Range | Height
-------------------------
0.0-0.0905 | 0.02601
0.0905-0.1811| 0.13678
0.1811-0.2716| 0.22647
0.2716-0.3621| 0.31481
0.3621-0.4527| 0.40681
0.4527-0.5432| 0.50200
0.5432-0.6337| 0.58746
0.6337-0.7243| 0.68153
0.7243-0.8148| 0.76208
0.8148-0.9053| 0.86030
0.9053-0.9958| 0.95027
0.9958-1 | 0.99584
The desired outcome is a histogram/bar plot with the edges according to Range and the heights according to Height.
You can split your Range and explode to get the edges of the bins:
import pandas as pd
from io import StringIO
data = StringIO("""Range | Height
-------------------------
0.0-0.0905 | 0.02601
0.0905-0.1811| 0.13678
0.1811-0.2716| 0.22647
0.2716-0.3621| 0.31481
0.3621-0.4527| 0.40681
0.4527-0.5432| 0.50200
0.5432-0.6337| 0.58746
0.6337-0.7243| 0.68153
0.7243-0.8148| 0.76208
0.8148-0.9053| 0.86030
0.9053-0.9958| 0.95027
0.9958-1 | 0.99584""")
df = pd.read_csv(data, sep="\s*\|\s*", engine="python", skiprows=[1])
df['Range'] = df['Range'].str.split('-')
df = df.explode('Range').drop_duplicates('Range').astype(float)
This will give you:
Range Height
0 0.0000 0.02601
0 0.0905 0.02601
1 0.1811 0.13678
2 0.2716 0.22647
3 0.3621 0.31481
4 0.4527 0.40681
5 0.5432 0.50200
6 0.6337 0.58746
7 0.7243 0.68153
8 0.8148 0.76208
9 0.9053 0.86030
10 0.9958 0.95027
11 1.0000 0.99584
Then use plt.stairs:
import matplotlib.pyplot as plt
plt.stairs(df['Height'].iloc[1:], edges=df['Range'].values, fill=True)
plt.show()
Output:

Pandas to mark both if cell value is a substring of another

A column with short and full form of people names, I want to unify them, if the name is a part of the other name. e.g. "James.J" and "James.Jones", I want to tag them both as "James.J".
data = {'Name': ["Amelia.Smith",
"Lucas.M",
"James.J",
"Elijah.Brown",
"Amelia.S",
"James.Jones",
"Benjamin.Johnson"]}
df = pd.DataFrame(data)
I can't figure out how to do it in Pandas. So only a xlrd way, with similarity ratio by SequenceMatcher (and sort it manually in Excel):
import xlrd
from xlrd import open_workbook,cellname
import xlwt
from xlutils.copy import copy
workbook = xlrd.open_workbook("C:\\TEM\\input.xlsx")
old_sheet = workbook.sheet_by_name("Sheet1")
from difflib import SequenceMatcher
wb = copy(workbook)
sheet = wb.get_sheet(0)
for row_index in range(0, old_sheet.nrows):
current = old_sheet.cell(row_index, 0).value
previous = old_sheet.cell(row_index-1, 0).value
sro = SequenceMatcher(None, current.lower(), previous.lower(), autojunk=True).ratio()
if sro > 0.7:
sheet.write(row_index, 1, previous)
sheet.write(row_index-1, 1, previous)
wb.save("C:\\TEM\\output.xls")
What's the nice Pandas way to do it/ Thank you.
using pandas, making use of str.split and .map with some boolean conditions to identify the dupes.
df1 = df['Name'].str.split('.',expand=True).rename(columns={0 : 'FName', 1 :'LName'})
df2 = df1.loc[df1['FName'].duplicated(keep=False)]\
.assign(ky=df['Name'].str.len())\
.sort_values('ky')\
.drop_duplicates(subset=['FName'],keep='first').drop('ky',1)
df['NewName'] = df1['FName'].map(df2.assign(newName=df2.agg('.'.join,1))\
.set_index('FName')['newName'])
print(df)
Name NewName
0 Amelia.Smith Amelia.S
1 Lucas.M NaN
2 James.J James.J
3 Elijah.Brown NaN
4 Amelia.S Amelia.S
5 James.Jones James.J
6 Benjamin.Johnson NaN
Here is an example of using apply with a custom function. For small dfs this should be fine; this will not scale well for large dfs. A more sophisticated data structure for memo would be an ok place to start to improve performance without degrading readability too much:
df = df.sort_values("Name")
def short_name(row, col="Name", memo=[]):
name = row[col]
for m_name in memo:
if name.startswith(m_name):
return m_name
memo.append(name)
return name
df["short_name"] = df.apply(short_name, axis=1)
df = df.sort_index()
output:
Name short_name
0 Amelia.Smith Amelia.S
1 Lucas.M Lucas.M
2 James.J James.J
3 Elijah.Brown Elijah.Brown
4 Amelia.S Amelia.S
5 James.Jones James.J
6 Benjamin.Johnson Benjamin.Johnson

How do you strip out only the integers of a column in pandas?

I am trying to strip out only the numeric values--which is the first 1 or 2 digits. Some values in the column contain pure strings and others contain special characters. See pic for the value count:
enter image description here
I have tried multiple methods:
breaks['_Size'] = breaks['Size'].fillna(0)
breaks[breaks['_Size'].astype(str).str.isdigit()]
breaks['_Size'] = breaks['_Size'].replace('\*','',regex=True).astype(float)
breaks['_Size'] = breaks['_Size'].str.extract('(\d+)').astype(int)
breaks['_Size'].map(lambda x: x.rstrip('aAbBcC'))
None are working. The dtype is object. To be clear, I am attempting to make a new column with only the digits (as an int/float) and if I could convert the fraction to a decimal that would be bonus
This works for dividing fractions and also allows for extra numbers to be present in the string (it returns you just the first sequence of numbers):
In [60]: import pandas as pd
In [61]: import re
In [62]: df = pd.DataFrame([0, "6''", '7"', '8in', 'text', '3/4"', '1a3'], columns=['_Size'])
In [63]: df
Out[63]:
_Size
0 0
1 6''
2 7"
3 8in
4 text
5 3/4"
6 1a3
In [64]: def cleaning_function(row):
...: row = str(row)
...: fractions = re.findall(r'(\d+)/(\d+)', row)
...: if fractions:
...: return float(int(fractions[0][0])/int(fractions[0][1]))
...: numbers = re.findall(r'[0-9]+', str(row))
...: if numbers:
...: return numbers[0]
...: return 0
...:
In [65]: df._Size.apply(cleaning_function)
Out[65]:
0 0
1 6
2 7
3 8
4 0
5 0.75
6 1
Name: _Size, dtype: object

Pandas accumulate data for linear regression

I try to adjust my data so total_gross per day is accumulated. E.g.
`Created` `total_gross` `total_gross_accumulated`
Day 1 100 100
Day 2 100 200
Day 3 100 300
Day 4 100 400
Any idea, how I have to change my code to have total_gross_accumulated available?
Here is my data.
my code:
from sklearn import linear_model
def load_event_data():
df = pd.read_csv('sample-data.csv', usecols=['created', 'total_gross'])
df['created'] = pd.to_datetime(df.created)
return df.set_index('created').resample('D').sum().fillna(0)
event_data = load_event_data()
X = event_data.index
y = event_data.total_gross
plt.xticks(rotation=90)
plt.plot(X, y)
plt.show()
List comprehension is the most pythonic way to do this.
SHORT answer:
This should give you the new column that you want:
n = event_data.shape[0]
# skip line 0 and start by accumulating from 1 until the end
total_gross_accumulated =[event_data['total_gross'][:i].sum() for i in range(1,n+1)]
# add the new variable in the initial pandas dataframe
event_data['total_gross_accumulated'] = total_gross_accumulated
OR faster
event_data['total_gross_accumulated'] = event_data['total_gross'].cumsum()
LONG answer:
Full code using your data:
import pandas as pd
def load_event_data():
df = pd.read_csv('sample-data.csv', usecols=['created', 'total_gross'])
df['created'] = pd.to_datetime(df.created)
return df.set_index('created').resample('D').sum().fillna(0)
event_data = load_event_data()
n = event_data.shape[0]
# skip line 0 and start by accumulating from 1 until the end
total_gross_accumulated =[event_data['total_gross'][:i].sum() for i in range(1,n+1)]
# add the new variable in the initial pandas dataframe
event_data['total_gross_accumulated'] = total_gross_accumulated
Results:
event_data.head(6)
# total_gross total_gross_accumulated
#created
#2019-03-01 3481810 3481810
#2019-03-02 4690 3486500
#2019-03-03 0 3486500
#2019-03-04 0 3486500
#2019-03-05 0 3486500
#2019-03-06 0 3486500
X = event_data.index
y = event_data.total_gross_accumulated
plt.xticks(rotation=90)
plt.plot(X, y)
plt.show()

Defining a function to play a graph from CSV data - Python panda

I am trying to play around with data analysis, taking in data from a simple CSV file I have created with random values in it.
I have defined a function that should allow the user to type in a value3 then from the dataFrame, plot a bar graph. The below:
def analysis_currency_pair():
x=raw_input("what currency pair would you like to analysie ? :")
print type(x)
global dataFrame
df1=dataFrame
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
df2 = df2.loc[x].plot(kind = 'bar')
When I call the function, the code returns my question, along with giving the output of the currency pair. However, it doesn't seem to put x (the value input by the user) into the later half of the function, and so no graph is produced.
Am I doing something wrong here?
This code works when we just put the value in, and not within a function.
I am confused!
I think you need rewrite your function with two parameters: x and df, which are passed to function analysis_currency_pair:
import pandas as pd
df = pd.DataFrame({"currencyPair": pd.Series({1: 'EURUSD', 2: 'EURGBP', 3: 'CADUSD'}),
"amount": pd.Series({1: 2, 2: 2, 3: 3.5}),
"a": pd.Series({1: 7, 2: 8, 3: 9})})
print df
# a amount currencyPair
#1 7 2.0 EURUSD
#2 8 2.0 EURGBP
#3 9 3.5 CADUSD
def analysis_currency_pair(x, df1):
print type(x)
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
df2 = df2.loc[x].plot(kind = 'bar')
#raw input is EURUSD or EURGBP or CADUSD
pair=raw_input("what currency pair would you like to analysie ? :")
analysis_currency_pair(pair, df)
Or you can pass string to function analysis_currency_pair:
import pandas as pd
df = pd.DataFrame({"currencyPair": [ 'EURUSD', 'EURGBP', 'CADUSD', 'EURUSD', 'EURGBP'],
"amount": [ 1, 2, 3, 4, 5],
"amount1": [ 5, 4, 3, 2, 1]})
print df
# amount amount1 currencyPair
#0 1 5 EURUSD
#1 2 4 EURGBP
#2 3 3 CADUSD
#3 4 2 EURUSD
#4 5 1 EURGBP
def analysis_currency_pair(x, df1):
print type(x)
#<type 'str'>
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
print df2
# amount
#currencyPair
#CADUSD 3
#EURGBP 7
#EURUSD 5
df2 = df2.loc[x].plot(kind = 'bar')
analysis_currency_pair('CADUSD', df)