How do I combine columns (date, time, year) of a CSV file in the concat format (to create one new date column) and then delete them after? - pandas

I am trying to combine three columns from a CSV file which happen to be 'day', 'month', and 'year'. I want to combine them into one 'date' column and then delete them. However, I keep receiving error messages. This is what I have so far:
import pandas as pd
def airport_data(cur_file, airport_code):
#import data
airportsearch = pd.read_csv('/work/Data.airports.csv')
#drop all rows of data that do not belong to the desired airport code
airportsearch.dropna([lambda x:
(x['origin_airport'].isin(['SFO'])) &
(x)['destination_airport'].isin (['SFO'])]
#make columns lower case
airportsearch.columns = airportsearch.columns.str.lower()
#create new column
df = pd.DataFrame(airportsearch, columns=["year", "month", "day"])
df.dates = pd.to_datetime(df.Year)
#drop old columns
airportsearch.drop(columns=['day', 'month', 'year'], axis=1)
The error that I am getting reads: 'Dataframe' object has no attribute 'Year'. I am unsure what this even means. Any help would be appreciated! Here is the report of the error in case that helps

The error message you have linked is that "Year" is not a valid attribute of your DataFrame. Python is case-sensitive, and from your code, it seems you've called that column "year".
Also, please cut & paste error messages as text rather than using screenshots on StackOverflow

Related

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

Pyspark: how to fix 'could not parse datatype: interval' error

I'm trying to add a new column to a pyspark df by substracting the values of two existing columns.
I already had a date_of_birth column available, so I inserted a current_date column with the following code:
import datetime
currentdate = "14-12-2021"
day,month,year = currentdate.split('-')
today = datetime.date(int(year),int(month),int(day))
df= df.withColumn("current_date", lit(today))
Displaying my df confirms that it worked. Looks a little something like this:
id
date_of_birth
current_date
01
1995-01-01
2021-12-2021
02
1987-02-16
2021-12-2021
I inserted the age column by substracting the values of date_of_birth and current_date.
df = df.withColumn('age', (df['current_date'] - df['date_of_birth ']))
Cell runs without a problem.
Here's where I'm stuck:
Once I try to display my dataframe again in order to verify that everything went smoothly, the following error occurs:
'could not parse datatype: interval'
I used df.types() to check what's happening, and apparently my newly inserted age column is of interval type.
How can I fix this?
Is there a way to display the age in years (int) in this particular scenario?
PS: both the date_of_birth and current_date cols have a date type.
Solved it. Mike's comment helped tons. Thank you!
Here's how I solved it:
# insert new column current_date with dummy data (in this case, 1s)
df = df.withColumn("current_date", lit(1))
# update data with current_date() function
df = df .withColumn("current_date", f.current_date())
# insert new column age with dummy data (in this case, 1s)
df = df .withColumn("age", lit(1))
# update data with months_between() function, divide by 12 to obtain years.
df = df .withColumn("age", f.months_between(df.current_date, df .date_of_birth)/12)
# round and cast as interger to get rid of decimals
df = df .withColumn("age", f.round(df["age"]).cast('integer'))
Would use one of the pyspark functions for calculating difference between dates.
pyspark.sql.functions.datediff
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.datediff.html
pyspark.sql.functions.months_between
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.months_between.html

How to delete two or more columns in csv using pandas

I have a csv file with 4 columns (Name, User_Name, Phone#, Email"). I want to delete those rows which have none value either in Phone# or Email. If there is none value in column (Phone#) and have some value in column(Email)or vise versa I don't want to delete that column. I hope you people will get what I want.
Sorry I don't have the code.
Thanks in advance
You can use the pandas notna() function to get a boolean series indicating which values are not missing. You can call this on both the email and the phone column and combine it with boolean | to get a truth series indicating that at least one of the email and phone columns is not missing. Then, you can use this series as a mask to filter the right rows.
import pandas as pd
# Import .csv file
df = pd.read_csv('mypath/myfile.csv')
# Filter to get rows where columns 'Email' and 'Phone#' are not both None
new_df = df[df['Email'].notna() | df['Phone#'].notna()]
# Write pandas df to disk
new_df.to_csv('mypath/mynewfile.csv', index=False)

pandas to_datetime leaves unconverted data

I'm trying to convert a column with strings that looks like "201905011" (year/month/day) to datetime, ideally showing as 05-01-2019 (month/day/year). I'm currently trying to following but it's not working for me.
pd.to_datetime(data.datetime, format = '%Y%m%d%H')
This leaves me with the error: "ValueError: unconverted data remains: 4"
I would like to know instead how to correctly do this.
I created an example based on ALollz comment. I have created a dataframe in which first row is correct and second row has and extra 0 in the end. If you use the method, it will return the rows in which the data doesn't match the specified format.
import pandas as pd
df = pd.DataFrame({"datefield":["201901010","20190101010"]})
df.loc[pd.to_datetime(df.datefield, format='%Y%m%d%H', errors='coerce').isnull(), 'datefield']
1 20190101010
Name: datefield, dtype: object

Data slicing a pandas frame - I'm having problems with unique

I am facing issues trying to select a subset of columns and running unique on it.
Source Data:
df_raw = pd.read_csv('data/master.csv', nrows=10000)
df_raw.shape()
Produces:
(10000, 86)
Process Data:
df = df_raw[['A','B','C']]
df.shape()
Produces:
(10000, 3)
Furthermore, doing:
df_raw.head()
df.head()
produces a correct list of rows and columns.
However,
print('RAW:',sorted(df_raw['A'].unique()))
works perfectly
Whilst:
print('PROCESSED:',sorted(df['A'].unique()))
produces:
AttributeError: 'DataFrame' object has no attribute 'unique'
What am I doing wrong? If the shape and head output are exactly what I want, I'm confused why my processed dataset is throwing errors. I did read Pandas 'DataFrame' object has no attribute 'unique' on SO which correctly states that unique needs to be applied to columns which is what I am doing.
This was a case of a duplicate column. Given this is proprietary data, I abstracted it as 'A', 'B', 'C' in this question and therefore masked the problem. (The real data set had 86 columns and I had duplicated one of those columns twice in my subset, and was trying to do a unique on that)
My problem was this:
df_raw = pd.read_csv('data/master.csv', nrows=10000)
df = df_raw[['A','B','C', 'A']] # <-- I did not realize I had duplicated A later.
This was causing problems when doing a unique on 'A'
From the entire dataframe to extract a subset a data based on a column ID. This works!!
df = df.drop_duplicates(subset=['Id']) #where 'id' is the column used to filter
print (df)