Using pandas on a CSV file, how do I filter a data set by "month" if the "date" column is in format "MM/DD/YYYY" - pandas

I have a CSV file with a column "date" whose values are formatted MM/DD/YYYY. I was wondering if there was a way I could filter the data in this file based on just month using pandas in python.
### csv file ###
___, Date, ...
12/4/2003
6/15/2012
#################
data = pd.read_csv("file.csv")
# how do i do this line?
is_data_july = data["date"].onlyCheckFirstChar == "6"
Thanks

You might want to have a look at pd.to_datetime.
df = pd.read_csv("file.csv")
df['date'] = pd.to_datetime(df['date'], ...)
mask = df['date'].dt.month == 6
df.loc[mask].to_csv("newfile.csv")
In fact, pd.read_csv has a shortcut for this (if the default options in pd.to_datetime work for you):
df = pd.read_csv("file.csv", parse_dates=['date'])
mask = df['date'].dt.month == 6
df.loc[mask].to_csv("newfile.csv")

Related

allowing python to impoert csv with duplicate column names in python

i have a data frame that looks like this:
there are in total 109 columns.
when i import the data using the read_csv it adds ".1",".2" to duplicate names .
is there any way to go around it ?
i have tried this :
df = pd.read_csv(r'C:\Users\agns1\Downloads\treatment1.csv',encoding = "ISO-8859-1",
sep='|', header=None)
df = df.rename(columns=df.iloc[0], copy=False).iloc[1:].reset_index(drop=True)
but it changed the data frame and wasnt helpful.
this is what it did to my data
python:
excel:
Remove header=None, because it is used for avoid convert first row of file to df.columns and then remove . with digits from columns names:
df = pd.read_csv(r'C:\Users\agns1\Downloads\treatment1.csv',encoding="ISO-8859-1", sep=',')
df.columns = df.columns.str.replace('\.\d+$','')

how to put first value in one column and remaining into other column?

ROCO2_CLEF_00001.jpg,C3277934,C0002978
ROCO2_CLEF_00002.jpg,C3265939,C0002942,C2357569
I want to make a pandas data frame from csv file.
I want to put first row entry(filename) into a column and give the column/header name "filenames", and remaining entries into another column name "class". How to do so?
in case your file hasn't a fixed number of commas per row, you could do the following:
import pandas as pd
csv_path = 'test_csv.csv'
raw_data = open(csv_path).readlines()
# clean rows
raw_data = [x.strip().replace("'", "") for x in raw_data]
print(raw_data)
# make split between data
raw_data = [ [x.split(",")[0], ','.join(x.split(",")[1:])] for x in raw_data]
print(raw_data)
# build the pandas Dataframe
column_names = ["filenames", "class"]
temp_df = pd.DataFrame(data=raw_data, columns=column_names)
print(temp_df)
filenames class
0 ROCO2_CLEF_00001.jpg C3277934,C0002978
1 ROCO2_CLEF_00002.jpg C3265939,C0002942,C2357569

how to merge two columns in one column as date with pandas?

I have a csv with the first column the date and the 5th the hours.
I would like to merge them in a single column with a specific format in order to write another csv file.
This is basically the file:
DATE,DAY.WEEK,DUMMY.WEEKENDS.HOLIDAYS,DUMMY.MONDAY,HOUR
01/01/2015,5,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,2,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,3,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,4,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,5,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,6,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,7,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,8,1,0,0,0,0,0,0,0,0,0,0,0
I have tried to read the dataframe as
dataR = pd.read_csv(fnamecsv)
and convert the first line to date, as:
date_dt3 = datetime.strptime(dataR["DATE"].iloc[0], '%d/%m/%Y')
However, this seems to me not the correct way for two reasons:
1) it add the hour without considering the hour column;
2) it seems not use the pandas feature.
Thanks for any kind of help,
Diedro
Using + operator
you need to convert data frame elements into string before join. you can also use different separators during join, e.g. dash, underscore or space.
import pandas as pd
df = pd.DataFrame({'Last': ['something', 'you', 'want'],
'First': ['merge', 'with', 'this']})
print('Before Join')
print(df, '\n')
print('After join')
df['Name']= df["First"].astype(str) +" "+ df["Last"]
print(df) ```
You can use read_csv with parameters parse_dates with list of both columns names and date_parser for specify format:
f = lambda x: pd.to_datetime(x, format='%d/%m/%Y %H')
dataR = pd.read_csv(fnamecsv, parse_dates=[['DATE','HOUR']], date_parser=f)
Or convert hours to timedeltas and add to datetimes later:
dataR = pd.read_csv(fnamecsv, parse_dates=[0], dayfirst=True)
dataR['DATE'] += pd.to_timedelta(dataR.pop('HOUR'), unit='H')

How to assign column variable 'Date' with date value from file name (Pandas)

I have the following file name....
Filename = ('../BSOS Supplier Sales (01289), 02.04.2018 - 08.04.2018 (X).xlsx')
I want to
1) Read the file into a df and
2) assign a new Column variable "Date" with the date captured in the above filename (02.04.2018 - 08.04.2018
How can this be done using pd.read_excel(Filename)?
You could read the content to a DataFrame
df = pd.read_excel(Filename)
Now extract the date with a regular expression
import re
date = re.compile(r'([\.\d]+ - [\.\d]+)').search(Filename).groups()[0]
And add to the DataFrame a new column with it
df['Date'] = date

Exclude last two rows when import a csv file using read_csv in Pandas

Afternoon All,
I am extracting data from SQL server to a csv format then reading the file in.
df = pd.read_csv(
'TKY_RFQs.csv',
sep='~',
usecols=[
0,1,2,3,4,5,6,7,8,9,
10,11,12,13,14,15,16,17,18,19,
20,21,22,23,24,25,26,27,28,29,
30,31,32,33,34,35,36,37
]
)
There is a blank row then the record count at the end of the file which I would like to remove.
End of file screenshot
I have been getting around the issue via this code but would like to resolve the root problem:
# Count_Row=df.shape[0] # gives number of row count
# df_Sample = df[['trading_book','state', 'rfq_num_of_dealers']].head(Count_Row-1)
Is there a way to exclude the last two rows in the file or alternativcely remove any row which has null values for all columns?
Pete
Could you try :
df = pd.read_csv(
'TKY_RFQs.csv',
sep='~',
usecols=[
0,1,2,3,4,5,6,7,8,9,
10,11,12,13,14,15,16,17,18,19,
20,21,22,23,24,25,26,27,28,29,
30,31,32,33,34,35,36,37
]
)[:-2]
Example:
from pandas import read_csv
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(url, names=names)[:-2] #to exclude last two rows
#data = read_csv(url, names=names) #to include all rows
print data
#description = data.describe()
You can make use of skiprows directly in .read_csv
df = pd.read_csv(
'TKY_RFQs.csv',
sep='~',
usecols=[
0,1,2,3,4,5,6,7,8,9,
10,11,12,13,14,15,16,17,18,19,
20,21,22,23,24,25,26,27,28,29,
30,31,32,33,34,35,36,37
],
skiprows=-2 # added this line to skip rows when reading
)