I have a csv with the first column the date and the 5th the hours.
I would like to merge them in a single column with a specific format in order to write another csv file.
This is basically the file:
DATE,DAY.WEEK,DUMMY.WEEKENDS.HOLIDAYS,DUMMY.MONDAY,HOUR
01/01/2015,5,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,2,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,3,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,4,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,5,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,6,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,7,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,8,1,0,0,0,0,0,0,0,0,0,0,0
I have tried to read the dataframe as
dataR = pd.read_csv(fnamecsv)
and convert the first line to date, as:
date_dt3 = datetime.strptime(dataR["DATE"].iloc[0], '%d/%m/%Y')
However, this seems to me not the correct way for two reasons:
1) it add the hour without considering the hour column;
2) it seems not use the pandas feature.
Thanks for any kind of help,
Diedro
Using + operator
you need to convert data frame elements into string before join. you can also use different separators during join, e.g. dash, underscore or space.
import pandas as pd
df = pd.DataFrame({'Last': ['something', 'you', 'want'],
'First': ['merge', 'with', 'this']})
print('Before Join')
print(df, '\n')
print('After join')
df['Name']= df["First"].astype(str) +" "+ df["Last"]
print(df) ```
You can use read_csv with parameters parse_dates with list of both columns names and date_parser for specify format:
f = lambda x: pd.to_datetime(x, format='%d/%m/%Y %H')
dataR = pd.read_csv(fnamecsv, parse_dates=[['DATE','HOUR']], date_parser=f)
Or convert hours to timedeltas and add to datetimes later:
dataR = pd.read_csv(fnamecsv, parse_dates=[0], dayfirst=True)
dataR['DATE'] += pd.to_timedelta(dataR.pop('HOUR'), unit='H')
Related
I want to clean up this date column inside of a csv file using python pandas.
Let's say my code is:
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'date_of_birth': ['10/25/2005 R','10/29/2002','01/01/2001 BD']
})
How can I clean up this mess for thousands of rows?
I thought of using:
df['date of birth'] = df['new date'].str[0,10]
but it does not work.
Easiest would be to do:
df['new date'] = df['date_of_birth'].apply(lambda d: d[0:10])
It just applies the function to take the first 10 characters to each entry.
The answer in the comment section works as well if you flip new date for date_of_birth and is probably faster:
df['new date'] = df['date_of_birth'].str[:10]
Just in case the dates aren't always the first 10 items in the string (e.g 1/1/2023) you can split it based on the space that comes after the date, then take the date:
df['new date'] = df['date_of_birth'].apply(lambda x: x.split(' ')[0])
I have one field in a pandas DataFrame that was imported as string format.
It should be a datetime variable. How do I convert it to a datetime column and then filter based on date.
Example:
df = pd.DataFrame({'date': ['05SEP2014:00:00:00.000']})
Use the to_datetime function, specifying a format to match your data.
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
If you have more than one column to be converted you can do the following:
df[["col1", "col2", "col3"]] = df[["col1", "col2", "col3"]].apply(pd.to_datetime)
You can use the DataFrame method .apply() to operate on the values in Mycol:
>>> df = pd.DataFrame(['05SEP2014:00:00:00.000'],columns=['Mycol'])
>>> df
Mycol
0 05SEP2014:00:00:00.000
>>> import datetime as dt
>>> df['Mycol'] = df['Mycol'].apply(lambda x:
dt.datetime.strptime(x,'%d%b%Y:%H:%M:%S.%f'))
>>> df
Mycol
0 2014-09-05
Use the pandas to_datetime function to parse the column as DateTime. Also, by using infer_datetime_format=True, it will automatically detect the format and convert the mentioned column to DateTime.
import pandas as pd
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], infer_datetime_format=True)
chrisb's answer works:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
however it results in a Python warning of
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I would guess this is due to some chaining indexing.
Time Saver:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'])
To silence SettingWithCopyWarning
If you got this warning, then that means your dataframe was probably created by filtering another dataframe. Make a copy of your dataframe before any assignment and you're good to go.
df = df.copy()
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f')
errors='coerce' is useful
If some rows are not in the correct format or not datetime at all, errors= parameter is very useful, so that you can convert the valid rows and handle the rows that contained invalid values later.
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
# for multiple columns
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime, format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
Setting the correct format= is much faster than letting pandas find out1
Long story short, passing the correct format= from the beginning as in chrisb's post is much faster than letting pandas figure out the format, especially if the format contains time component. The runtime difference for dataframes greater than 10k rows is huge (~25 times faster, so we're talking like a couple minutes vs a few seconds). All valid format options can be found at https://strftime.org/.
1 Code used to produce the timeit test plot.
import perfplot
from random import choices
from datetime import datetime
mdYHMSf = range(1,13), range(1,29), range(2000,2024), range(24), *[range(60)]*2, range(1000)
perfplot.show(
kernels=[lambda x: pd.to_datetime(x),
lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M:%S.%f'),
lambda x: pd.to_datetime(x, infer_datetime_format=True),
lambda s: s.apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))],
labels=["pd.to_datetime(df['date'])",
"pd.to_datetime(df['date'], format='%m/%d/%Y %H:%M:%S.%f')",
"pd.to_datetime(df['date'], infer_datetime_format=True)",
"df['date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))"],
n_range=[2**k for k in range(20)],
setup=lambda n: pd.Series([f"{m}/{d}/{Y} {H}:{M}:{S}.{f}"
for m,d,Y,H,M,S,f in zip(*[choices(e, k=n) for e in mdYHMSf])]),
equality_check=pd.Series.equals,
xlabel='len(df)'
)
Just like we convert object data type to float or int. Use astype()
raw_data['Mycol']=raw_data['Mycol'].astype('datetime64[ns]')
I have one field in a pandas DataFrame that was imported as string format.
It should be a datetime variable. How do I convert it to a datetime column and then filter based on date.
Example:
df = pd.DataFrame({'date': ['05SEP2014:00:00:00.000']})
Use the to_datetime function, specifying a format to match your data.
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
If you have more than one column to be converted you can do the following:
df[["col1", "col2", "col3"]] = df[["col1", "col2", "col3"]].apply(pd.to_datetime)
You can use the DataFrame method .apply() to operate on the values in Mycol:
>>> df = pd.DataFrame(['05SEP2014:00:00:00.000'],columns=['Mycol'])
>>> df
Mycol
0 05SEP2014:00:00:00.000
>>> import datetime as dt
>>> df['Mycol'] = df['Mycol'].apply(lambda x:
dt.datetime.strptime(x,'%d%b%Y:%H:%M:%S.%f'))
>>> df
Mycol
0 2014-09-05
Use the pandas to_datetime function to parse the column as DateTime. Also, by using infer_datetime_format=True, it will automatically detect the format and convert the mentioned column to DateTime.
import pandas as pd
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], infer_datetime_format=True)
chrisb's answer works:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
however it results in a Python warning of
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I would guess this is due to some chaining indexing.
Time Saver:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'])
To silence SettingWithCopyWarning
If you got this warning, then that means your dataframe was probably created by filtering another dataframe. Make a copy of your dataframe before any assignment and you're good to go.
df = df.copy()
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f')
errors='coerce' is useful
If some rows are not in the correct format or not datetime at all, errors= parameter is very useful, so that you can convert the valid rows and handle the rows that contained invalid values later.
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
# for multiple columns
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime, format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
Setting the correct format= is much faster than letting pandas find out1
Long story short, passing the correct format= from the beginning as in chrisb's post is much faster than letting pandas figure out the format, especially if the format contains time component. The runtime difference for dataframes greater than 10k rows is huge (~25 times faster, so we're talking like a couple minutes vs a few seconds). All valid format options can be found at https://strftime.org/.
1 Code used to produce the timeit test plot.
import perfplot
from random import choices
from datetime import datetime
mdYHMSf = range(1,13), range(1,29), range(2000,2024), range(24), *[range(60)]*2, range(1000)
perfplot.show(
kernels=[lambda x: pd.to_datetime(x),
lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M:%S.%f'),
lambda x: pd.to_datetime(x, infer_datetime_format=True),
lambda s: s.apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))],
labels=["pd.to_datetime(df['date'])",
"pd.to_datetime(df['date'], format='%m/%d/%Y %H:%M:%S.%f')",
"pd.to_datetime(df['date'], infer_datetime_format=True)",
"df['date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))"],
n_range=[2**k for k in range(20)],
setup=lambda n: pd.Series([f"{m}/{d}/{Y} {H}:{M}:{S}.{f}"
for m,d,Y,H,M,S,f in zip(*[choices(e, k=n) for e in mdYHMSf])]),
equality_check=pd.Series.equals,
xlabel='len(df)'
)
Just like we convert object data type to float or int. Use astype()
raw_data['Mycol']=raw_data['Mycol'].astype('datetime64[ns]')
I would like to concatenate all the columns with comma-delimitted in pandas.
But as you can seem it is very laborious tasks since I manually typed all the column indices.
de = data[3]+","+data[4]+","+data[5]+....+","+data[1511]
do you have any idea to avoid above procedure in pandas in python3?
First convert all columns to strings by DataFrame.astype and then possible add join per rows:
df = data.astype(str).apply(','.join, axis=1)
Or after convert to strings add ,, then sum and last remove last , by Series.str.rstrip:
df = data.astype(str).add(',').sum(axis=1).str.rstrip(',')
I have one field in a pandas DataFrame that was imported as string format.
It should be a datetime variable. How do I convert it to a datetime column and then filter based on date.
Example:
df = pd.DataFrame({'date': ['05SEP2014:00:00:00.000']})
Use the to_datetime function, specifying a format to match your data.
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
If you have more than one column to be converted you can do the following:
df[["col1", "col2", "col3"]] = df[["col1", "col2", "col3"]].apply(pd.to_datetime)
You can use the DataFrame method .apply() to operate on the values in Mycol:
>>> df = pd.DataFrame(['05SEP2014:00:00:00.000'],columns=['Mycol'])
>>> df
Mycol
0 05SEP2014:00:00:00.000
>>> import datetime as dt
>>> df['Mycol'] = df['Mycol'].apply(lambda x:
dt.datetime.strptime(x,'%d%b%Y:%H:%M:%S.%f'))
>>> df
Mycol
0 2014-09-05
Use the pandas to_datetime function to parse the column as DateTime. Also, by using infer_datetime_format=True, it will automatically detect the format and convert the mentioned column to DateTime.
import pandas as pd
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], infer_datetime_format=True)
chrisb's answer works:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
however it results in a Python warning of
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I would guess this is due to some chaining indexing.
Time Saver:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'])
To silence SettingWithCopyWarning
If you got this warning, then that means your dataframe was probably created by filtering another dataframe. Make a copy of your dataframe before any assignment and you're good to go.
df = df.copy()
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f')
errors='coerce' is useful
If some rows are not in the correct format or not datetime at all, errors= parameter is very useful, so that you can convert the valid rows and handle the rows that contained invalid values later.
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
# for multiple columns
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime, format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
Setting the correct format= is much faster than letting pandas find out1
Long story short, passing the correct format= from the beginning as in chrisb's post is much faster than letting pandas figure out the format, especially if the format contains time component. The runtime difference for dataframes greater than 10k rows is huge (~25 times faster, so we're talking like a couple minutes vs a few seconds). All valid format options can be found at https://strftime.org/.
1 Code used to produce the timeit test plot.
import perfplot
from random import choices
from datetime import datetime
mdYHMSf = range(1,13), range(1,29), range(2000,2024), range(24), *[range(60)]*2, range(1000)
perfplot.show(
kernels=[lambda x: pd.to_datetime(x),
lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M:%S.%f'),
lambda x: pd.to_datetime(x, infer_datetime_format=True),
lambda s: s.apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))],
labels=["pd.to_datetime(df['date'])",
"pd.to_datetime(df['date'], format='%m/%d/%Y %H:%M:%S.%f')",
"pd.to_datetime(df['date'], infer_datetime_format=True)",
"df['date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))"],
n_range=[2**k for k in range(20)],
setup=lambda n: pd.Series([f"{m}/{d}/{Y} {H}:{M}:{S}.{f}"
for m,d,Y,H,M,S,f in zip(*[choices(e, k=n) for e in mdYHMSf])]),
equality_check=pd.Series.equals,
xlabel='len(df)'
)
Just like we convert object data type to float or int. Use astype()
raw_data['Mycol']=raw_data['Mycol'].astype('datetime64[ns]')