Pandas - get_loc nearest for whole column - pandas

I have a df with date and price.
Given a datetime, I would like to find the price at the nearest date.
This works for one input datetime:
import requests, xlrd, openpyxl, datetime
import pandas as pd
file = "E:/prices.csv" #two columns: Timestamp (UNIX epoch), Price (int)
df = pd.read_csv(file, index_col=None, names=["Timestamp", "Price"])
df['Timestamp'] = pd.to_datetime(df['Timestamp'],unit='s')
df = df.drop_duplicates(subset=['Timestamp'], keep='last')
df = df.set_index('Timestamp')
file = "E:/input.csv" #two columns: ID (string), Date (dd-mm-yyy hh:ss:mm)
dfinput = pd.read_csv(file, index_col=None, names=["ID", "Date"])
dfinput['Date'] = pd.to_datetime(dfinput['Date'], dayfirst=True)
exampledate = pd.to_datetime("20-3-2020 21:37", dayfirst=True)
exampleprice = df.iloc[df.index.get_loc(exampledate, method='nearest')]["Price"]
print(exampleprice) #price as output
I have another dataframe with the datetimes ("dfinput") I want to lookup prices of and save in a new column "Price".
Something like this which is obviously not working:
dfinput['Date'] = pd.to_datetime(dfinput['Date'], dayfirst=True)
dfinput['Price'] = df.iloc[df.index.get_loc(dfinput['Date'], method='nearest')]["Price"]
dfinput.to_csv('output.csv', index=False, columns=["Hash", "Date", "Price"])
Can I do this for a whole column or do I need to iterate over all rows?

I think you need merge_asof (cannot test, because no sample data):
df = df.sort_index('Timestamp')
dfinput = dfinput.sort_values('Date')
df = pd.merge_asof(df, dfinput, left_index=True, right_on='Date', direction='nearest')

Related

Simpler Pandas filter, select datetime columns to dict

I have an employee schedule that I filter to get a DF of name, timein, timeout similar to this:
employees = [('BOB', datetime(2022,12,1,6,0,0), datetime(2022,12,1,14,0,0)),
('BOB', datetime(2022,12,2,6,0,0), datetime(2022,12,2,14,0,0)),
('GILL', datetime(2022,12,1,6,0,0), datetime(2022,12,1,14,0,0)),
('GILL', datetime(2022,12,3,6,0,0), datetime(2022,12,3,14,0,0)),
('TOBY', datetime(2022,12,1,14,0,0), datetime(2022,12,1,20,30,0))]
labels = ['name', 'timein', 'timeout']
df = pd.DataFrame.from_records(employees, columns=labels)
I need to compare the timedelta between the current timeout and the next timein value. My thought is to filter, select and update to a dict:
{'BOB' : [(datetime(2022,12,1,6,0,0), datetime(2022,12,1,14,0,0)), (datetime(2022,12,2,6,0,0), datetime(2022,12,2,14,0,0)), etc...}
Then it should be a simple test (for a common error pattern): dict['BOB'][i+1][0] - dict['BOB'][i][1] < fixed_duration
But Pandas goes through some Numpy wringer and produces gosh knows what:
results = {}
names = df['name'].unique().tolist()
for name in names:
times = df.loc[df['name'] == 'BOB', ['schedulein', 'scheduleout']].values.tolist()
results.update({name: times})
results
{'BOB': [[1669874400000000000, 1669903200000000000],
[1669960800000000000, 1669989600000000000]],
'GILL': [[1669874400000000000, 1669903200000000000],
[1669960800000000000, 1669989600000000000]],
'TOBY': [[1669874400000000000, 1669903200000000000],
[1669960800000000000, 1669989600000000000]]}
Why can't I get Datetime out?
Bonus if you know a more Pandas way to, I call it, "filter, select".
Here is what you want to do:
import pandas as pd
from datetime import datetime
employees = [('BOB', datetime(2022,12,1,6,0,0), datetime(2022,12,1,14,0,0)),('BOB', datetime(2022,12,2,6,0,0), datetime(2022,12,2,14,0,0)),('GILL', datetime(2022,12,1,6,0,0), datetime(2022,12,1,14,0,0)),('GILL', datetime(2022,12,3,6,0,0), datetime(2022,12,3,14,0,0)),('TOBY', datetime(2022,12,1,14,0,0), datetime(2022,12,1,20,30,0))]
labels = ['name', 'timein', 'timeout']
df = pd.DataFrame.from_records(employees, columns=labels)
results = {}
names = df['name'].unique().tolist()
for name in names:
times = df.loc[df['name'] == name, ['timein', 'timeout']].astype(object).values.tolist()
results.update({name: times})
print(results)
which gives you:
{'BOB': [[Timestamp('2022-12-01 06:00:00'), Timestamp('2022-12-01 14:00:00')], [Timestamp('2022-12-02 06:00:00'), Timestamp('2022-12-02 14:00:00')]], 'GILL': [[Timestamp('2022-12-01 06:00:00'), Timestamp('2022-12-01 14:00:00')], [Timestamp('2022-12-03 06:00:00'), Timestamp('2022-12-03 14:00:00')]], 'TOBY': [[Timestamp('2022-12-01 14:00:00'), Timestamp('2022-12-01 20:30:00')]]}

How to sort object data type index into datetime in pandas?

Index(['Apr-20', 'Apr-21', 'Apr-22', 'Aug-20', 'Aug-21', 'Aug-22', 'Dec-20',
'Dec-21', 'Dec-22', 'Feb-21', 'Feb-22', 'Jan-21', 'Jan-22', 'Jan-23',
'Jul-20', 'Jul-21', 'Jul-22', 'Jun-20', 'Jun-21', 'Jun-22', 'Mar-20',
'Mar-21', 'Mar-22', 'May-20', 'May-21', 'May-22', 'Nov-20', 'Nov-21',
'Nov-22', 'Oct-20', 'Oct-21', 'Oct-22', 'Sep-20', 'Sep-21', 'Sep-22'],
dtype='object', name='months')
How could I sort this month-year object dtype into the datetime format such as 'MMM-YY' in pandas? Take thanks in advance!
If need only sorting values of index like datetimes use DataFrame.sort_index with key parameter:
df = df.sort_index(key=lambda x: pd.to_datetime(x, format='%b-%y'))
If need DatetimeIndex and then sorting use:
df.index = pd.to_datetime(df.index, format='%b-%y')
df = df.sort_index()
Another idea is create PeriodIndex:
df.index = pd.to_datetime(df.index, format='%b-%y').to_period('m')
df = df.sort_index()

df.groupby('columns').apply(''.join()), join all the cells to a string

df.groupby('columns').apply(''.join()), join all the cells to a string.
This is for a junior dataprocessor. In the past, I've tried many ways.
import pandas as pd
data = {'key':['a','b','c','a','b','c','a'], 'profit':
[12,3,4,5,6,7,9],'income':['j','d','d','g','d','t','d']}
df = pd.DataFrame(data)
df = df.set_index(‘key’)
#df2 is expected result
data2 = {'a':['12j5g9d'],'b':['3d6d'],'c':['4d7t']}
df2 = pd.DataFrame(data2)
df2 = df2.set_index(‘key’)
Here's a simple solution, where we first translate the integers to strings and then concatenate profit and income, then finally we concatenate all strings under the same key:
data = {'key':['a','b','c','a','b','c','a'], 'profit':
[12,3,4,5,6,7,9],'income':['j','d','d','g','d','t','d']}
df = pd.DataFrame(data)
df['profit_income'] = df['profit'].apply(str) + df['income']
res = df.groupby('key')['profit_income'].agg(''.join)
print(res)
output:
key
a 12j5g9d
b 3d6d
c 4d7t
Name: profit_income, dtype: object
This question can be solved couple different ways:
First add an extra column by concatenating the profit and income columns.
import pandas as pd
data = {'key':['a','b','c','a','b','c','a'], 'profit':
[12,3,4,5,6,7,9],'income':['j','d','d','g','d','t','d']}
df = pd.DataFrame(data)
df = df.set_index('key')
df['profinc']=df['profit'].astype(str)+df['income']
1) Using sum
df2=df.groupby('key').profinc.sum()
2) Using apply and join
df2=df.groupby('key').profinc.apply(''.join)
Results from both of the above would be the same:
key
a 12j5g9d
b 3d6d
c 4d7t

Quantile across rows and down columns using selected columns only [duplicate]

I have a dataframe with column names, and I want to find the one that contains a certain string, but does not exactly match it. I'm searching for 'spike' in column names like 'spike-2', 'hey spike', 'spiked-in' (the 'spike' part is always continuous).
I want the column name to be returned as a string or a variable, so I access the column later with df['name'] or df[name] as normal. I've tried to find ways to do this, to no avail. Any tips?
Just iterate over DataFrame.columns, now this is an example in which you will end up with a list of column names that match:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)
Output:
['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']
Explanation:
df.columns returns a list of column names
[col for col in df.columns if 'spike' in col] iterates over the list df.columns with the variable col and adds it to the resulting list if col contains 'spike'. This syntax is list comprehension.
If you only want the resulting data set with the columns that match you can do this:
df2 = df.filter(regex='spike')
print(df2)
Output:
spike-2 spiked-in
0 1 7
1 2 8
2 3 9
This answer uses the DataFrame.filter method to do this without list comprehension:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)
print(df.filter(like='spike').columns)
Will output just 'spike-2'. You can also use regex, as some people suggested in comments above:
print(df.filter(regex='spike|spke').columns)
Will output both columns: ['spike-2', 'hey spke']
You can also use df.columns[df.columns.str.contains(pat = 'spike')]
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
colNames = df.columns[df.columns.str.contains(pat = 'spike')]
print(colNames)
This will output the column names: 'spike-2', 'spiked-in'
More about pandas.Series.str.contains.
# select columns containing 'spike'
df.filter(like='spike', axis=1)
You can also select by name, regular expression. Refer to: pandas.DataFrame.filter
df.loc[:,df.columns.str.contains("spike")]
Another solution that returns a subset of the df with the desired columns:
df[df.columns[df.columns.str.contains("spike|spke")]]
You also can use this code:
spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]
Getting name and subsetting based on Start, Contains, and Ends:
# from: https://stackoverflow.com/questions/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html
import pandas as pd
data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)
print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist()
print("Contains")
print(colNames_contains)
print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist()
print("Starts")
print(colNames_starts)
print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist()
print("Ends")
print(colNames_ends)
print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)
print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)
print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)

Pandas DataFrame expand existing dataset to finer timestamp

I am trying to make this piece of code faster, it is failing on conversion of ~120K rows to ~1.7m.
Essentially, I am trying to convert each date stamped entry into 14, representing each DOW from PayPeriodEndingDate to T-14
Does anyone have a better suggestion other than iteruples to do this loop?
Thanks!!
df_Final = pd.DataFrame()
for row in merge4.itertuples():
listX = []
listX.append(row)
df = pd.DataFrame(listX*14)
df = df.reset_index().drop('Index',axis=1)
df['Hours'] = df['Hours']/14
df['AmountPaid'] = df['AmountPaid']/14
df['PayPeriodEnding'] = np.arange(df.loc[:,'PayPeriodEnding'][0] - np.timedelta64(14,'D'), df.loc[:,'PayPeriodEnding'][0], dtype='datetime64[D]')
frames = [df_Final,df]
df_Final = pd.concat(frames,axis=0)
df_Final