append one CSV to another as a dataframe based on certain column names without headers in pandas - pandas

I have a CSV in a data frame with these columns and data
ID. Col1. Col2. Col3 Col4
I have another CSV with just
ID. Column2. Column3
How can I append 1st CSV with 2nd data under their corresponding headers, without including CSV2 header
My Expected Dataframe
ID. Col1. Col2. Col3 Col4
Data.CSV1 Data.CSV1 Data.CSV1 Data.CSV1 Data.CSV1
ID.DataCSV2. Column2.DataCSV2. Column3.DataCSV2
Given the column names in CSV to is different

IIUC,
you'll need to clean your column names then you can do a simple concat.
import re
def col_cleaner(cols):
new_cols = [re.sub('\s+|\.','',x) for x in cols]
return new_cols
df1.columns = col_cleaner(df1.columns)
df2.columns = col_cleaner(df2.columns)
#output
#['ID', 'Val1', 'Val2', 'Val3', 'Val4']
#['ID', 'Val2', 'Val3']
new_df = pd.concat([df1,df2],axis=0)
new_df.to_csv('your_csv.csv')

I think you can use .append
df1.append(df2)
col1 col2 col3
0 1 2 2.0
1 2 3 3.0
2 3 4 4.0
0 3 2 NaN
1 4 3 NaN
2 5 4 NaN
Sample Data
df1 = pd.DataFrame({'col1': [1,2,3], 'col2':[2,3,4], 'col3':[2,3,4]})
df2 = pd.DataFrame({'col1': [3,4,5], 'col2':[2,3,4]})

Related

Throw and exception and move on in pandas

I have created a pandas dataframe called df with the following code:
import numpy as np
import pandas as pd
ds = {'col1' : ["1","2","3","A"], "col2": [45,6,7,87], "col3" : ["23","4","5","6"]}
df = pd.DataFrame(ds)
The dataframe looks like this:
print(df)
col1 col2 col3
0 1 45 23
1 2 6 4
2 3 7 5
3 A 87 6
Now, col1 and col3 are objects:
print(df.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 4 non-null object
1 col2 4 non-null int64
2 col3 4 non-null object
I want to transform, where possible, the object columns into floats.
For example, I can convert col3 into a float like this:
df['col3'] = df['col3'].astype(float)
But I cannot convert col1 into a float:
df['col1'] = df['col1'].astype(float)
ValueError: could not convert string to float: 'A'
Is it possible to create a code that converts, where possible, the object columns into float and by-passes the cases in which it is not possible (so, without throwing an error which stops the process)? I guess it has to do with exceptions?
I think you can make a test whether the content in a string, object or not, in which cases the conversion won't be made. Did you try this ?
for y in df.columns:
if(df[y].dtype == object):
continue
else:
# your treatement here
or, apparently in pandas 0.20.2, there is a function which makes the test : is_string_dtype(df['col1'])
This is in the case where all the values of a column are of the same type, if the values are mixed, iterate over df.values
I have sorted it.
def convert_float(x):
try:
return x.astype(float)
except:
return x
cols = df.columns
for i in range(len(cols)):
df[cols[i]] = convert_float(df[cols[i]])
print(df)
print(df.info())

how to use union of two dataframes which have different column numbers pyspark

I have two dataframes:
df1 which consists of column from col1 to col7
df2 which consists of column from col1 to col9
I need to perform union of these two dataframes,
however it fails because of the two extra columns.
Any idea what other function can be used?
Add two columns to df2 and then go ahead with the union.
Import -
from pyspark.sql.functions import lit
If col8 and col9 are numbers then do -
new_df = df2.withColumn("col8", lit(float('nan'))).withColumn("col9", lit(float('nan')))
Or if col8 and col9 are strings then do -
new_df = df2.withColumn("col8", lit("")).withColumn("col9", lit(""))
Now union the new_df with df1.

Keeping only one column after join

For the following code
d = {'col1': [33,34], 'col2': [1,2]}
d1 = {'col3': [33,34], 'col4': [3,4]}
df = pd.DataFrame(data=d)
df1 = pd.DataFrame(data=d1)
myDF=pd.merge(df, df1, how='inner', left_on=['col1'], right_on=['col3'])
in myDF, it kept two columns (col1 and col3), is there a way to keep either one column (col1 or col3) with merge function? (of course I can drop a column by applying drop later on after merge, just want to see is it possible to simplify the step.)
Use rename column, so output is only one column and also is not necessary use left_on and right_on parameters, because on is enough:
myDF=pd.merge(df.rename(columns={'col1':'col3'}), df1, on='col3')
print (myDF)
col3 col2 col4
0 33 1 3
1 34 2 4
myDF=pd.merge(df, df1.rename(columns={'col3':'col1'}), on='col1')
print (myDF)
col1 col2 col4
0 33 1 3
1 34 2 4

Instead of appending value as a new column on the same row, pandas adds a new column AND new row

What I have below is an example of the type of the type of concatenation that I am trying to do.
df = pd.DataFrame(np.array([1, 2, 3]).reshape((1, 3)), columns = ['col1', 'col2', 'col3'], index = ['a'])
df2 = pd.DataFrame() # already exists elsewhere in code
df2 = df2.append([df, pd.Series(1, name = 'label')])
The result I am hoping for is:
col1 col2 col3 label
a 1.0 2.0 3.0 1
but I get is
col1 col2 col3 0
a 1.0 2.0 3.0 NaN
0 NaN NaN NaN 1.0
I know that I'm joining these wrong, but I cannot seem to figure out how its done. Any advice?
This is because the series you are adding has an incompatible index. The original dataframe has ['a'] as the specified index and there is no index specified in the series. If you want to add a new column without specifying an index, the following will give you what you want:
df = pd.DataFrame(np.array([1, 2, 3]).reshape((1, 3)), columns = ['col1', 'col2', 'col3'], index = ['a'])
df2 = pd.DataFrame() # already exists elsewhere in code
df2 = df2.append([df]) # append the desired dataframe
df2['label'] = 1 # add a new column with the value 1 across all rows
print(df2.to_string())
col1 col2 col3 label
a 1 2 3 1

Pandas concat resulting in NaN rows? [duplicate]

This question already has answers here:
pandas concat generates nan values
(3 answers)
Closed 3 years ago.
I have two data frames with the same amount of rows: 1434, and I'd like to concatenate them amongst the axis 1:
res = pd.concat([df_resolved, df1], axis=1)
The two data frames do not have any columns that have the same name. I'd just like to join them like:
df1: df2:
col1 col2 | col3 col4
1 0 | 9 0
6 0 | 0 0
=
concatenated_df:
col1 col2 col3 col4
1 0 9 0
6 0 0 0
This works fine on a small example like this, but for some reason I end up with many NaN rows if I try it on my original dataset, which is too big for me to oversee (I'm trying to join 1434x24 and 1434x17458 shaped data frames). So the outcome is kinda like:
concatenated_df:
col1 col2 col3 col4
col1 col2 col3 col4
1 0 9 0
6 0 0 0
NaN NaN 0 0
But I don't see why. Do you have any ideas how this can occur? I've tried renaming all the columns in the smaller data frame by appending a _xyz string to the column names, but the issue stays the same.
The answer to a similar question here might help: pandas concat generates nan values
Briefly, if the row indices for the two dataframes have any mismatches, the concatenated dataframe will have NaNs in the mismatched rows. If you don't need to keep the indices the way they are, using df.reset_index(drop=True, inplace=True) on both datasets before concatenating should fix the problem.
I used to have the same problem , when I generated the training and testing set.This is my solution , However , I do not know why pd.concat does not work in this situation too ...
l1=df.values.tolist()
l2=df_resolved.values.tolist()
for i in range(len(l1)):
l1[i].extend(l2[i])
df=pd.DataFrame(l1,columns=df.columns.tolist()+df_resolved.columns.tolist())