How can I create a new column in an existing dataframe, fill it with na and specify its dtype to be int64?

How can I create a new column in an existing dataframe, fill it with na and specify its dtype to be int64? - pandas

I have a big dataframe, and I would like to create an extra row and fill it with NA and also specify the dtype of the column to be int64. How would I do this?
e.g.
dataframe:
col1 col2
5 's'
7 'g'
6 'f'
Let's say I want to add a new column called new_col and populate it with NA and specify the dtype to be int64.
I tried something like:
df['new_col'].fillna().dtype('int64')
But this doesn't seem to work.
The desired output is:
col1 col2 new_col
5 's' na
7 'g' na
6 'f' na
I can't show the desired dtype of new_col but I would like it to be int64.

import numpy as np
df['new_col'] = np.nan
df['new_col'] = pd.to_numeric(df['new_col'], errors='coerce').astype('int', errors='ignore')

Related

Throw and exception and move on in pandas

I have created a pandas dataframe called df with the following code:
import numpy as np
import pandas as pd
ds = {'col1' : ["1","2","3","A"], "col2": [45,6,7,87], "col3" : ["23","4","5","6"]}
df = pd.DataFrame(ds)
The dataframe looks like this:
print(df)
col1 col2 col3
0 1 45 23
1 2 6 4
2 3 7 5
3 A 87 6
Now, col1 and col3 are objects:
print(df.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 4 non-null object
1 col2 4 non-null int64
2 col3 4 non-null object
I want to transform, where possible, the object columns into floats.
For example, I can convert col3 into a float like this:
df['col3'] = df['col3'].astype(float)
But I cannot convert col1 into a float:
df['col1'] = df['col1'].astype(float)
ValueError: could not convert string to float: 'A'
Is it possible to create a code that converts, where possible, the object columns into float and by-passes the cases in which it is not possible (so, without throwing an error which stops the process)? I guess it has to do with exceptions?

I think you can make a test whether the content in a string, object or not, in which cases the conversion won't be made. Did you try this ?
for y in df.columns:
if(df[y].dtype == object):
continue
else:
# your treatement here
or, apparently in pandas 0.20.2, there is a function which makes the test : is_string_dtype(df['col1'])
This is in the case where all the values of a column are of the same type, if the values are mixed, iterate over df.values

I have sorted it.
def convert_float(x):
try:
return x.astype(float)
except:
return x
cols = df.columns
for i in range(len(cols)):
df[cols[i]] = convert_float(df[cols[i]])
print(df)
print(df.info())

Dataframe columns cleaning

I am trying to clean a number of columns in a dataset and try to iterate to different columns.
import pandas as pd
df = pd.DataFrame({
'A': [7.3\N\P,nan\T\Z,11.0\R\Z],
'B': [nan\J\N, nan\A\G, 10.8\F\U],
'C': [12.4\A\I, 13.3\H\Z, 8.200000000000001\B\W]})
for name, values in df.iloc[:, 0:3].iteritems():
def myreplace(s):
for char in ['\A','\B','\C','\D','\E','\F','\G','\H','\I',
'\J','\K','\L','\M','\\N','\O','\P','\Q','\R',
'\S','\T','\V','\W','\X','\Y','\Z','\\U']:
s = s.map(lambda x: x.replace(char, ''))
return s
df = df.apply(myreplace)
I get the error: 'float' object has no attribue 'replace'
I could run this part on one column and it works, but I need to run it on several columns so this part does not work as I get an error that 'Dataframe'objec has no attribute 'str'
df_data.str.replace('[\\\|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z]', '')
I am really new to python pandas dataframe. Will appreciate the help

Given, assuming the goal is to extract numbers from the strings:
A B C
0 7.3\N\P nan\J\N 12.4\A\I
1 nan\T\Z nan\A\G 13.3\H\Z
2 11.0\R\Z 10.8\F\U 8.200000000000001\B\W
Doing:
cols = ['A', 'B', 'C']
for col in cols:
df[col] = df[col].str.extract('(\d*\.\d*)').astype(float)
Output:
A B C
0 7.3 NaN 12.4
1 NaN NaN 13.3
2 11.0 10.8 8.2

append one CSV to another as a dataframe based on certain column names without headers in pandas

I have a CSV in a data frame with these columns and data
ID. Col1. Col2. Col3 Col4
I have another CSV with just
ID. Column2. Column3
How can I append 1st CSV with 2nd data under their corresponding headers, without including CSV2 header
My Expected Dataframe
ID. Col1. Col2. Col3 Col4
Data.CSV1 Data.CSV1 Data.CSV1 Data.CSV1 Data.CSV1
ID.DataCSV2. Column2.DataCSV2. Column3.DataCSV2
Given the column names in CSV to is different

IIUC,
you'll need to clean your column names then you can do a simple concat.
import re
def col_cleaner(cols):
new_cols = [re.sub('\s+|\.','',x) for x in cols]
return new_cols
df1.columns = col_cleaner(df1.columns)
df2.columns = col_cleaner(df2.columns)
#output
#['ID', 'Val1', 'Val2', 'Val3', 'Val4']
#['ID', 'Val2', 'Val3']
new_df = pd.concat([df1,df2],axis=0)
new_df.to_csv('your_csv.csv')

I think you can use .append
df1.append(df2)
col1 col2 col3
0 1 2 2.0
1 2 3 3.0
2 3 4 4.0
0 3 2 NaN
1 4 3 NaN
2 5 4 NaN
Sample Data
df1 = pd.DataFrame({'col1': [1,2,3], 'col2':[2,3,4], 'col3':[2,3,4]})
df2 = pd.DataFrame({'col1': [3,4,5], 'col2':[2,3,4]})

Instead of appending value as a new column on the same row, pandas adds a new column AND new row

What I have below is an example of the type of the type of concatenation that I am trying to do.
df = pd.DataFrame(np.array([1, 2, 3]).reshape((1, 3)), columns = ['col1', 'col2', 'col3'], index = ['a'])
df2 = pd.DataFrame() # already exists elsewhere in code
df2 = df2.append([df, pd.Series(1, name = 'label')])
The result I am hoping for is:
col1 col2 col3 label
a 1.0 2.0 3.0 1
but I get is
col1 col2 col3 0
a 1.0 2.0 3.0 NaN
0 NaN NaN NaN 1.0
I know that I'm joining these wrong, but I cannot seem to figure out how its done. Any advice?

This is because the series you are adding has an incompatible index. The original dataframe has ['a'] as the specified index and there is no index specified in the series. If you want to add a new column without specifying an index, the following will give you what you want:
df = pd.DataFrame(np.array([1, 2, 3]).reshape((1, 3)), columns = ['col1', 'col2', 'col3'], index = ['a'])
df2 = pd.DataFrame() # already exists elsewhere in code
df2 = df2.append([df]) # append the desired dataframe
df2['label'] = 1 # add a new column with the value 1 across all rows
print(df2.to_string())
col1 col2 col3 label
a 1 2 3 1

Pandas: Read CSV with multiple headers

I have the following (caret) delimited csv (the file needs to be in this format):
HEADER^20181130
[Col1]^[Col2]^[Col3]^[Col4]^[Col5]
The^quick^"bro,wn"^fox^jumped
over^the^fat^lazy^dog
m1213^4,12r4^fr,34^,56,gt^12fr,12fr
Trailer^N
and I need to read the file while preserving the order of the headers so that the output matches the following:
However, when I try:
df = pd.read_csv(source_file, header=[0,1], sep=r"[| ^]", engine='python')
I get:
and if I try:
df = pd.read_csv(source_file, header=[1], sep=r"[| ^]",engine='python')
I just get:
Any way to import this file with both headers? Bonus points if we can remove the opening and closing brackets for the header without removing them elsewhere in the file.
Note: I have sep=r"[| ^] because the file could be delimited with pipes as well.

To keep both header rows, I would suggest to create a pd.Multindex from the first two rows of your data.
Therefore, you will need to import your data without header.
import numpy as np
import pandas as pd
df = pd.read_csv('~/Desktop/stackoverflow_data.csv', sep=r"[| ^]", header=None, engine='python')
df.reset_index(inplace=True)
df.fillna(np.nan, inplace=True)
df.head()
Output:
level_0 level_1 level_2 0 1
0 HEADER 20181130 NaN NaN NaN
1 [Col1] [Col2] [Col3] [Col4] [Col5]
2 The quick "bro,wn" fox jumped
3 over the fat lazy dog
4 m1213 4,12r4 fr,34 ,56,gt 12fr,12fr
Then you will need to zip the two first rows as tuples (and btw remove the square brackets) and create a Multindex object:
cols = tuple(zip(df.iloc[0], df.iloc[1].apply(lambda x: x[1:-1])))
header = pd.MultiIndex.from_tuples(cols, names=['Lvl_1', 'Lvl_2'])
# delete the header rows and assign new header
df.drop([0,1], inplace=True)
df.columns = header
df.head()
This is the output:
Lvl_1 HEADER 20181130 NaN
Lvl_2 Col1 Col2 Col3 Col4 Col5
2 The quick "bro,wn" fox jumped
3 over the fat lazy dog
4 m1213 4,12r4 fr,34 ,56,gt 12fr,12fr
5 Trailer N NaN NaN NaN

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How can I create a new column in an existing dataframe, fill it with na and specify its dtype to be int64? - pandas

import numpy as np df['new_col'] = np.nan df['new_col'] = pd.to_numeric(df['new_col'], errors='coerce').astype('int', errors='ignore')

Related

Throw and exception and move on in pandas

Dataframe columns cleaning

append one CSV to another as a dataframe based on certain column names without headers in pandas

Instead of appending value as a new column on the same row, pandas adds a new column AND new row

Pandas: Read CSV with multiple headers

Categories

Resources