Making Many Empty Columns in PySpark - dataframe

I have a list of many dataframes each with a subset schema of a master schema. In order to union these dataframes, I need to construct a common schema among all the dataframes. My thought is that I need to create empty columns for all the missing columns for each of the dataframes. I have about on average 80 missing features and 100s of dataframes.
This is somewhat of a duplicate or inspired by Concatenate two PySpark dataframes
I am currently implementing things this way:
from pyspark.sql.functions import lit
for df in dfs: # list of dataframes
for feature in missing_features: # list of strings
df = df.withColumn(feature, lit(None).cast("string"))
This seems to be taking a significant amount of time. Is there a faster way to concat these dataframes with null in place of missing features?

You might be able to cut time a little by replacing your code with:
cols = ["*"] + [lit(None).cast("string").alias(f) for f in missing_features]
dfs_new = [df.select(cols) for df in dfs]

Related

Pandas Map on some rows only?

Is there a way to use pandas map on only some rows and ignore all others?
Example DF :
df = pd.DataFrame({'ProductID': ['Playstation','Playstation','Playstation','Sony Playstation','Sony Playstation','Sony Playstation'],
'date': [dt.date(2022,11,1),dt.date(2022,11,5),dt.date(2022,11,1),dt.date(2022,11,10),dt.date(2022,11,15),dt.date(2022,11,1)],
'customerID' : ['01','01','01','01','02','01'],
'brand': ['Cash','Cash','Game','Cash','Cash','Game'],
'gmv': [10,50,30,40,50,60]})
As you can see, I have similar products that, for some reason, are sometimes as "Playstation" and sometime as "Sony Playstation".
How can I do a pandas map to replace "Playstation" to "Sony Playstation"?
Take into that this is an example with only 2 brands. On my DF I have several brands, so make a dict of them is not viable (and I might need to change many brandes at once).
Can I apply map on a filtered df? I´ve tried apply map on partial DF:
Gift.loc[(Gift.brand == 'PlayStation')].brand.map({'PlayStation': 'Sony PlayStation'}, na_action='ignore')
If so, how to I move the new data to origim df?

How to select a value in a dataframe with MultiIndex?

I use the Panda library to analyze the data coming from an excel file.
I used pivot_table to get a pivot table with the information I'm interested in. I end up with a multi index array.
For "OPE-2016-0001", I would like to obtain the figures for 2017 for example. I've tried lots of things and nothing works. What is the correct method to use? thank you
import pandas as pd
import numpy as np
from math import *
import tkinter as tk
pd.set_option('display.expand_frame_repr', False)
df = pd.read_csv('datas.csv')
def tcd_op_dataExcercice():
global df
new_df = df.assign(Occurence=1)
tcd= new_df.pivot_table(index=['Numéro opération',
'Libellé opération'],
columns=['Exercice'],
values=['Occurence'],
aggfunc=[np.sum],
margins=True,
fill_value=0,
margins_name='Total')
print(tcd)
print(tcd.xs('ALSTOM 8', level='Libellé opération', drop_level=False))
tcd_op_dataExcercice()
I get the following table (image).
How do I get the value framed in red?
You can use .loc to select rows by a DataFrame's Index's labels. If the Index is a MultiIndex, it will index into the first level of the MultiIndex (Numéro Opéracion in your case). Though you can pass a tuple to index into both levels (e.g. if you specifically wanted ("OPE-2016-0001", "ALSTOM 8"))
It's worth noting that the columns of your pivoted data are also a MultiIndex, because you specified the aggfunc, values and columns as lists, rather than individual values (i.e. without the []). Pandas creates a MultiIndex because of these lists, even though they had one
argument.
So you'll also need to pass a tuple to index into the columns to get the value for 2017:
tcd.loc["OPE-2016-0001", ('sum', 'Occurence', 2017)]
If you had instead just specified the aggfunc etc as individual strings, the columns would just be the years and you could select the values by:
tcd.loc["OPE-2016-0001", 2017]
Or if you specifically wanted the value for ALSTOM 8:
tcd.loc[("OPE-2016-0001", "ALSTOM 8"), 2017]
An alternative to indexing into a MultiIndex would also be to just .reset_index() after pivoting -- in which case the levels of the MultiIndex will just become columns in the data. And you can then select rows based on the values of those columns. E.g (assuming you specified aggfunc etc as strings):
tcd = tcd.reset_index()
tcd.query("'Numéro Opéracion' == 'OPE-2016-0001'")[2017]

python pandas describe groupby output question

I use describe with groupby on a dataframe such as:
df_stats = df[["x","y"]].describe(df["key1"],df["key2"])
this produces the standard set of stats for "x" and "y: in df by key/key2 value combinations. these combinations are, in general, not know a priori.
if I print df_stats it will list the various combinations of key1/key2 values as row values but I
cannot find out where these combination values are being stored (should be in the df_stats dataframe, no?)
the objective is to be able to have other dataframe data to "look up" data in the df_stats rows so as to include it in calculations - appreciate any insights in advance - pulling my hair out
thx - j

How to concat 3 dataframes with each into sequential columns

I'm trying to understand how to concat three individual dataframes (i.e df1, df2, df3) into a new dataframe say df4 whereby each individual dataframe has its own column left to right order.
I've tried using concat with axis = 1 to do this, but it appears not possible to automate this with a single action.
Table1_updated = pd.DataFrame(columns=['3P','2PG-3Io','3Io'])
Table1_updated=pd.concat([get_table1_3P,get_table1_2P_max_3Io,get_table1_3Io])
Note that with the exception of get_table1_2P_max_3Io, which has two columns, all other dataframes have one column
For example,
get_table1_3P =
get_table1_2P_max_3Io =
get_table1_3Io =
Ultimately, i would like to see the following:
I believe you need first concat and tthen change order by list of columns names:
Table1_updated=pd.concat([get_table1_3P,get_table1_2P_max_3Io,get_table1_3Io], axis=1)
Table1_updated = Table1_updated[['3P','2PG-3Io','3Io']]

Merge two csv files that have a similar row structure but no common index between them

I have two csv files that I want to merge, by adding the column information from one csv to another. However they have no common index between them, but they do have the same amount of rows(they are in order). I have seen many examples of joining csv files based on index and on same numbers, however my csv files have no similar index, but they are in order. I've tried a few different examples with no luck.
mycsvfile1
"a","1","mike"
"b","2","sally"
"c","3","derek"
mycsvfile2
"boy","63","retired"
"girl","55","employed"
"boy","22","student"
Desired outcome for outcsvfile3
"a","1","mike","boy","63","retired"
"b","2","sally","girl","55","employed"
"c","3","derek","boy","22","student"
Code:
import csv
import panada
df2 = pd.read_csv("mycsvfile1.csv",header=None)
df1 = pd.read_csv("mycsvfile2.csv", header=None)
df3 = pd.merge(df1,df2)
Using
df3 = pd.merge([df1,df2])
Adds the data into a new row which doesn't help me. Any assistance is greatly appreciated.
If both dataframes have numbered indexes (i.e. starting at 0 and increasing by 1 - which is the default behaviour of pd.read_csv), and assuming that both DataFrames are already sorted in the correct order so that the rows match up, then this should do it:
df3 = pd.merge(df1,df2, left_index=True, right_index=True)
You do not have any common columns between df1 and df2 , besides the index . So we can using concat
pd.concat([df1,df2],axis=1)