Using pandas on a CSV file, how do I filter a data set by "month" if the "date" column is in format "MM/DD/YYYY"

Using pandas on a CSV file, how do I filter a data set by "month" if the "date" column is in format "MM/DD/YYYY" - pandas

I have a CSV file with a column "date" whose values are formatted MM/DD/YYYY. I was wondering if there was a way I could filter the data in this file based on just month using pandas in python.
### csv file ###
___, Date, ...
12/4/2003
6/15/2012
#################
data = pd.read_csv("file.csv")
# how do i do this line?
is_data_july = data["date"].onlyCheckFirstChar == "6"
Thanks

You might want to have a look at pd.to_datetime.
df = pd.read_csv("file.csv")
df['date'] = pd.to_datetime(df['date'], ...)
mask = df['date'].dt.month == 6
df.loc[mask].to_csv("newfile.csv")
In fact, pd.read_csv has a shortcut for this (if the default options in pd.to_datetime work for you):
df = pd.read_csv("file.csv", parse_dates=['date'])
mask = df['date'].dt.month == 6
df.loc[mask].to_csv("newfile.csv")

Related

allowing python to impoert csv with duplicate column names in python

i have a data frame that looks like this:
there are in total 109 columns.
when i import the data using the read_csv it adds ".1",".2" to duplicate names .
is there any way to go around it ?
i have tried this :
df = pd.read_csv(r'C:\Users\agns1\Downloads\treatment1.csv',encoding = "ISO-8859-1",
sep='|', header=None)
df = df.rename(columns=df.iloc[0], copy=False).iloc[1:].reset_index(drop=True)
but it changed the data frame and wasnt helpful.
this is what it did to my data
python:
excel:

Remove header=None, because it is used for avoid convert first row of file to df.columns and then remove . with digits from columns names:
df = pd.read_csv(r'C:\Users\agns1\Downloads\treatment1.csv',encoding="ISO-8859-1", sep=',')
df.columns = df.columns.str.replace('\.\d+$','')

how to put first value in one column and remaining into other column?

ROCO2_CLEF_00001.jpg,C3277934,C0002978
ROCO2_CLEF_00002.jpg,C3265939,C0002942,C2357569
I want to make a pandas data frame from csv file.
I want to put first row entry(filename) into a column and give the column/header name "filenames", and remaining entries into another column name "class". How to do so?

in case your file hasn't a fixed number of commas per row, you could do the following:
import pandas as pd
csv_path = 'test_csv.csv'
raw_data = open(csv_path).readlines()
# clean rows
raw_data = [x.strip().replace("'", "") for x in raw_data]
print(raw_data)
# make split between data
raw_data = [ [x.split(",")[0], ','.join(x.split(",")[1:])] for x in raw_data]
print(raw_data)
# build the pandas Dataframe
column_names = ["filenames", "class"]
temp_df = pd.DataFrame(data=raw_data, columns=column_names)
print(temp_df)
filenames class
0 ROCO2_CLEF_00001.jpg C3277934,C0002978
1 ROCO2_CLEF_00002.jpg C3265939,C0002942,C2357569

how to merge two columns in one column as date with pandas?

I have a csv with the first column the date and the 5th the hours.
I would like to merge them in a single column with a specific format in order to write another csv file.
This is basically the file:
DATE,DAY.WEEK,DUMMY.WEEKENDS.HOLIDAYS,DUMMY.MONDAY,HOUR
01/01/2015,5,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,2,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,3,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,4,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,5,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,6,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,7,1,0,0,0,0,0,0,0,0,0,0,0
01/01/2015,5,1,0,8,1,0,0,0,0,0,0,0,0,0,0,0
I have tried to read the dataframe as
dataR = pd.read_csv(fnamecsv)
and convert the first line to date, as:
date_dt3 = datetime.strptime(dataR["DATE"].iloc[0], '%d/%m/%Y')
However, this seems to me not the correct way for two reasons:
1) it add the hour without considering the hour column;
2) it seems not use the pandas feature.
Thanks for any kind of help,
Diedro

Using + operator
you need to convert data frame elements into string before join. you can also use different separators during join, e.g. dash, underscore or space.
import pandas as pd
df = pd.DataFrame({'Last': ['something', 'you', 'want'],
'First': ['merge', 'with', 'this']})
print('Before Join')
print(df, '\n')
print('After join')
df['Name']= df["First"].astype(str) +" "+ df["Last"]
print(df) ```

You can use read_csv with parameters parse_dates with list of both columns names and date_parser for specify format:
f = lambda x: pd.to_datetime(x, format='%d/%m/%Y %H')
dataR = pd.read_csv(fnamecsv, parse_dates=[['DATE','HOUR']], date_parser=f)
Or convert hours to timedeltas and add to datetimes later:
dataR = pd.read_csv(fnamecsv, parse_dates=[0], dayfirst=True)
dataR['DATE'] += pd.to_timedelta(dataR.pop('HOUR'), unit='H')

How to assign column variable 'Date' with date value from file name (Pandas)

I have the following file name....
Filename = ('../BSOS Supplier Sales (01289), 02.04.2018 - 08.04.2018 (X).xlsx')
I want to
1) Read the file into a df and
2) assign a new Column variable "Date" with the date captured in the above filename (02.04.2018 - 08.04.2018
How can this be done using pd.read_excel(Filename)?

You could read the content to a DataFrame
df = pd.read_excel(Filename)
Now extract the date with a regular expression
import re
date = re.compile(r'([\.\d]+ - [\.\d]+)').search(Filename).groups()[0]
And add to the DataFrame a new column with it
df['Date'] = date

Exclude last two rows when import a csv file using read_csv in Pandas

Afternoon All,
I am extracting data from SQL server to a csv format then reading the file in.
df = pd.read_csv(
'TKY_RFQs.csv',
sep='~',
usecols=[
0,1,2,3,4,5,6,7,8,9,
10,11,12,13,14,15,16,17,18,19,
20,21,22,23,24,25,26,27,28,29,
30,31,32,33,34,35,36,37
]
)
There is a blank row then the record count at the end of the file which I would like to remove.
End of file screenshot
I have been getting around the issue via this code but would like to resolve the root problem:
# Count_Row=df.shape[0] # gives number of row count
# df_Sample = df[['trading_book','state', 'rfq_num_of_dealers']].head(Count_Row-1)
Is there a way to exclude the last two rows in the file or alternativcely remove any row which has null values for all columns?
Pete

Could you try :
df = pd.read_csv(
'TKY_RFQs.csv',
sep='~',
usecols=[
0,1,2,3,4,5,6,7,8,9,
10,11,12,13,14,15,16,17,18,19,
20,21,22,23,24,25,26,27,28,29,
30,31,32,33,34,35,36,37
]
)[:-2]
Example:
from pandas import read_csv
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(url, names=names)[:-2] #to exclude last two rows
#data = read_csv(url, names=names) #to include all rows
print data
#description = data.describe()

You can make use of skiprows directly in .read_csv
df = pd.read_csv(
'TKY_RFQs.csv',
sep='~',
usecols=[
0,1,2,3,4,5,6,7,8,9,
10,11,12,13,14,15,16,17,18,19,
20,21,22,23,24,25,26,27,28,29,
30,31,32,33,34,35,36,37
],
skiprows=-2 # added this line to skip rows when reading
)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using pandas on a CSV file, how do I filter a data set by "month" if the "date" column is in format "MM/DD/YYYY" - pandas

Related

allowing python to impoert csv with duplicate column names in python

how to put first value in one column and remaining into other column?

how to merge two columns in one column as date with pandas?

How to assign column variable 'Date' with date value from file name (Pandas)

Exclude last two rows when import a csv file using read_csv in Pandas

Categories

Resources