I want to concat multiple pandas DataFrames using a function. For example, see the following.
import pandas as pd
import numpy as np
df =pd.DataFrame({'A':['Apple','Yahoo','Google']})
df2 =pd.DataFrame({'A':['Microsoft', 'Apple', 'Google']})
nan_value = 0
combined = pd.concat(dfs, join='outer').fillna(nan_value)
But, when I try to put the same in to a function as the following it gives an error: "TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame".
def combine_dataframes(df):
nan_value = 0
combined = pd.concat(df, join='outer').fillna(nan_value)
return combined
dfs = [df, df2]
combined = [combine_dataframes(i) for i in dfs]
Related
My dtype is changing after i unhash the foo and groupby i get # we require a list, but not a 'str'.
I wanted if the value (in my case Date) in the 1 column is the same then the text from the 3 column goes there after a ',' sign, in my final project
import os
import pandas as pd
import dateutil
from pandas import DataFrame
from datetime import datetime, timedelta
data_file_folder = '.\Data'
df = []
for file in os.listdir(data_file_folder):
if file.endswith('.xlsx'):
print('Loading File {0}...'.format(file))
df.append(pd.read_excel(os.path.join(data_file_folder,file),sheet_name='Sheet1'))
df_master = pd.concat(df,axis=0)
df_master['Date'] = df_master['Date'].dt.date
#foo = lambda a: ", ".join(a)
#df_master = df_master.groupby(by='Date').agg({'Tweet': foo}).reset_index()
#df_master.to_excel('.\NewFolder\example.xlsx',index=False)
#df_master
I have a PYSPARK dataframe df with values 'latitude' and 'longitude':
+---------+---------+
| latitude|longitude|
+---------+---------+
|51.822872| 4.905615|
|51.819645| 4.961687|
| 51.81964| 4.961713|
| 51.82256| 4.911187|
|51.819263| 4.904488|
+---------+---------+
I want to get the UTM coordinates ('x' and 'y') from the dataframe columns. To do this, I need to feed the values 'longitude' and 'latitude' to the following function from pyproj. The result 'x' and 'y' should then be append to the original dataframe df. This is how I did it in Pandas:
from pyproj import Proj
pp = Proj(proj='utm',zone=31,ellps='WGS84', preserve_units=False)
xx, yy = pp(df["longitude"].values, df["latitude"].values)
df["X"] = xx
df["Y"] = yy
How would I do this in Pyspark?
Use pandas_udf, feed the function with an array and then return an array as well. see below:
from pyspark.sql.functions import array, pandas_udf, PandasUDFType
from pyproj import Proj
from pandas import Series
#pandas_udf('array<double>', PandasUDFType.SCALAR)
def get_utm(x):
pp = Proj(proj='utm',zone=31,ellps='WGS84', preserve_units=False)
return Series([ pp(e[0], e[1]) for e in x ])
df.withColumn('utm', get_utm(array('longitude','latitude'))) \
.selectExpr("*", "utm[0] as X", "utm[1] as Y") \
.show()
Iam using rpy2 to get comorbidity Index of patients , i got the results but iam not able to convert those output to pandas Dataframe
below is the code
#creating Datframe
data = {"person_id":[1,1,1,2,2,3],
"dx_1":["F11","E40","","F32","C77","G10"],
"dx_2":["F1P","E400","","F322","C737",""]}
#converting Pandas Dataframe to R Datframe using rpy2
import rpy2
from rpy2.robjects import pandas2ri
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
r_dataframe = pandas2ri.py2ri(df1)
print(r_dataframe)
#installing 'comorbidity ' package using rpy2
R = rpy2.robjects.r
DTW = importr('comorbidity')
#executing comorbidity function by using one column icd_1
output = DTW.comorbidity(x = r_dataframe, id = "person_id", code = "icd_1",
score = "charlson", assign0 = False,
icd = "icd10")
print(output)
but not able to convert output to pandas dataframe
import rpy2, rpy2.robjects as robjects, rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
#Converting data frames back and forth between rpy2 and pandas
from rpy2.robjects import r, pandas2ri
#convert output to pandas dataframe
pandas2ri.ri2py_dataframe(output)
getting below error
TypeError: Parameter 'categories' must be list-like, was
please help
Thanks in advance
import pandas as pd
data = {'x':['011','012','013'],'y':['022','033','041']}
Df = pd.DataFrame(data = data,type = str)
Df.to_csv("path/to/save.csv")
There result I've obtained seems as this
To achieve such result it will be easier to export directly to xlsx file, even without setting dtype of DataFrame.
import pandas as pd
writer = pd.ExcelWriter('path/to/save.xlsx')
data = {'x':['011','012','013'],'y':['022','033','041']}
Df = pd.DataFrame(data = data)
Df.to_excel(writer,"Sheet1")
writer.save()
I've tried also some other methods like prepending apostrophe or quoting all fields with ", but it gave no effect.
Does anyone know how one goes about enabling the REFS_OK flag in numpy? I cannot seem to find a clear explanation online.
My code is:
import sys
import string
import numpy as np
import pandas as pd
SNP_df = pd.read_csv('SNPs.txt',sep='\t',index_col = None ,header = None,nrows = 101)
output = open('100 SNPs.fa','a')
for i in SNP_df:
data = SNP_df[i]
data = np.array(data)
for j in np.nditer(data):
if j == 0:
output.write(("\n>%s\n")%(str(data(j))))
else:
output.write(data(j))
I keep getting the error message: Iterator operand or requested dtype holds references, but the REFS_OK was not enabled.
I cannot work out how to enable the REFS_OK flag so the program can continue...
I have isolated the problem. There is no need to use np.nditer. The main problem was with me misinterpreting how Python would read iterator variables in a for loop. The corrected code is below.
import sys
import string
import fileinput
import numpy as np
SNP_df = pd.read_csv('datafile.txt',sep='\t',index_col = None ,header = None,nrows = 5000)
output = open('outputFile.fa','a')
for i in range(1,51):
data = SNP_df[i]
data = np.array(data)
for j in range(0,1):
output.write(("\n>%s\n")%(str(data[j])))
for k in range(1,len(data)):
output.write(str(data[k]))
If you really want to enable the flag, I have an working example.
Python 2.7, numpy 1.14.2, pandas 0.22.0
import pandas as pd
import numpy as np
# get all data as panda DataFrame
data = pd.read_csv("./monthdata.csv")
print(data)
# get values as numpy array
data_ar = data.values # numpy.ndarray, every element is a row
for row in data_ar:
print(row)
sum = 0
count = 0
for month in np.nditer(row, flags=["refs_OK"], op_flags=["readwrite"]):
print month