How to encode multiple String columns in the dataset while preprocessing? - pandas

I am very new to Machine learning and as part of learning, I was going thru the data preprocessing.
This is how my dataset looks like:
I wanted to predict the delete_ind based on the data I have in first three columns.
These are the steps I followed.
schema = StructType([StructField('id', StringType(), True), \
StructField('company_code', StringType(), True), \
StructField("cost", FloatType(), True), \
StructField('delete_ind', StringType(), True)
])
data = spark.sql('select id, company_code, cost, delete_ind from dbname.table limit 5000')
pdata = data.toPandas()
Selecting columns from Pandas Dataframe
x = pdata.iloc[:, 0:-1].values
y = pdata.iloc[:, -1].values
Adding default values in place of missing data
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x[:, 2:3])
Encoding the first two column of the dataset to default values
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0:2])], remainder='passthrough')
x = np.array(ct.fit_transform(x))
While I am trying to encode the first two columns of x, I am seeing the below error.
SyntaxError: invalid syntax
File "<command-1856943148058177>", line 1
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0:2])], remainder='passthrough')
^
SyntaxError: invalid syntax
Could anyone let me know what is the mistake I am doing here ?

Related

How to Parse nested json column to two columns called key and value

I have a source table with 3 columns. One of the column contains json values. some of the rows contain simple json but some of the rows contain nested json like in image's source table. I want the target table to look like in image attached. could someone help with pyspark code or sql code to put it in databrick?
This json doesn't have a fixed schema. it can be varried in different ways but ultimately its a json.
source and target tables
I am expecting pyspark code for above question.
Here is the sample code used to achieve this.
%py
df1 = spark.sql("select eventId, AppId, eventdata from tableA)
df1 = df1 .withColumn("EventData",from_json(df1 .eventdata,MapType(StringType(),StringType())))
df1 = df1 .select(df1.eventId,df1.AppId, explode_outer(df1.EventData))
display(df1)
this resulted in below output
[output][1]
Below is a sample json:
{
"brote":"AKA",
"qFilter":"{\"xfilters\":[{\"Molic\":\"or\",\"filters\":[{\"logic\":\"and\",\"field\":\"Name\",\"operator\":\"contains\",\"value\":\"*R-81110\"},{\"logic\":\"and\",\"field\":\"Title\",\"operator\":\"contains\",\"value\":\"*R-81110\"}]}],\"pSize\":200,\"page\":1,\"ignoreConfig\":false,\"relatedItemFilters\":[],\"entityType\":\"WAFADocuments\"}",
"config":"[\"PR_NMO\"]",
"title":"All Documents",
"selected":"PR_NMO",
"selectedCreateConfig":"PR_NMO",
"selectedQueryConfigs":[
"PR_CVO"
],
"selectedRoles":[
"RL_ZAC_Planner"
]
}
[1]: https://i.stack.imgur.com/Oftvr.png
The requirement is hard to achieve as the schema of the nested values is not fixed. To do it with the sample you have given, you can use the following code:
df1 = df.withColumn("EventData",from_json(df.EventData,MapType(StringType(),StringType())))
df1 = df1 .select(df1.eventID,df1.AppID, explode_outer(df1.EventData))
#df1.show()
df2 = df1.filter(df1.key == 'orders')
user_schema = ArrayType(
StructType([
StructField("id", StringType(), True),
StructField("type", StringType(), True)
])
)
df3 = df2.withColumn("value", from_json("value", user_schema)).selectExpr( "eventID", "AppID", "key","inline(value)")
df3 = df3.melt(['eventID','AppID','key'],['id','type'],'sub_order','val')
req = df3.withColumn('key',concat(df3.key,lit('.'),df3.sub_order))
final_df = df1.filter(df1.key != 'orders').union(req.select('eventID','AppID','key','val'))
final_df.show()
This might be not possible is the schema would be constantly changing.

Join and combine dataframe by array intersection

I want to join three different tables using an array of aliases as the join condition:
Table 1:
table_1 = spark.createDataFrame([("T1", ['a','b','c']), ("T2", ['d','e','f'])], ["id", "aliases"])
Table 2:
table_2 =spark.createDataFrame([("P1", ['a','h','e']), ("P2", ['j','k','l'])], ["id", "aliases"])
Table 3:
table_3= spark.createDataFrame([("G1", ['a','n','o']), ("G2", ['p','q','l']), ("G3", ['c','z'])], ["id", "aliases"])
And I want to get a table like this:
Aliases
table1_ids
table2_id
table3_id
[n, b, h, o, a, e, d, c, f, z]
[T1, T2]
[P1]
[G1,G3]
[k, q, j, p, l]
[]
[P2]
[G2]
Where all related aliases are in the same row and there is no repeated ID of the three initial tables. In other words, I am trying to group by a common alias and to collect all different IDs in which these aliases can be found.
I have used Spark SQL for the code examples, but feel free of using Pyspark or Pandas.
Thanks in advance.
Well, I have thought about it and I think that what I described it's a Graph problem. More precisely, I was trying to find all connected components, so 'aliases' + 'ids' will be the graph vertices, and once it has been found all the graph components (all the subgraphs that are not connected to other subgraph), it will be extracted the IDs from the results. It's very important to be able to differentiate the IDs from the values (aliases).
To implement a solution, I have used Graphframes (Thanks to this comment):
from graphframes import *
import pyspark.sql.functions as f
df = table_1.unionAll(table_2).unionAll(table_3)
edgesDF = df.select(f.explode("Aliases").alias("src"),f.col("id").alias("dst")) # Columns should be called 'src' and 'dst'.
verticesDF = edgesDF.select('src').union(edgesDF.select('dst'))
verticesDF = verticesDF.withColumnRenamed('src', 'id')
graph=GraphFrame(verticesDF,edgesDF)
components_df = graph.connectedComponents(algorithm="graphx")
components_grouped_df = components_df.groupBy("component").agg(f.collect_set("id").alias("aliases"))
So now we will have a Dataframe like this:
And as we want to have each ID in a different column, we have to extract them from the 'aliases' column and to create three new ones. For doing so, we will use regex and a UDF:
import re
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.types import StructType,StructField, StringType, ArrayType, DoubleType
df_schema = StructType([
StructField("table_1_ids",ArrayType(StringType()),True),
StructField("table_2_ids",ArrayType(StringType()),True),
StructField("table_3_ids",ArrayType(StringType()),True),
StructField("aliases",ArrayType(StringType()),True)
])
#udf(returnType=df_schema)
def get_correct_results(aliases):
regex_table_1 = "(T\d)"
regex_table_2 = "(P\d)"
regex_table_3 = "(G\d)"
table_1_ids = []
table_2_ids = []
table_3_ids = []
elems_to_remove = []
for elem in aliases:
result_table_1 = re.search(regex_table_1, elem)
result_table_2 = re.search(regex_table_2, elem)
result_table_3 = re.search(regex_table_3, elem)
if result_table_1:
elems_to_remove.append(elem)
table_1_ids.append(result_table_1.group(1))
elif result_table_2:
elems_to_remove.append(elem)
table_2_ids.append(result_table_2.group(1))
elif result_table_3:
elems_to_remove.append(elem)
table_3_ids.append(result_table_3.group(1))
return {'table_1_ids':list(set(table_1_ids)), 'table_2_ids':list(set(table_2_ids)),'table_3_ids':list(set(table_3_ids)), 'aliases':list(set(aliases) - set(elems_to_remove))}
So, finally, we use the previous UDF in the 'final_df':
master_df = components_grouped_df.withColumn("return",get_correct_results(f.col("aliases")))\
.selectExpr("component as row_id","return.aliases as aliases","return.table_1_ids as table_1_ids","return.table_2_ids as table_2_ids", "return.table_3_ids as table_3_ids")
And the final DF will look like this:

Scaling down high dimensional pandas' data frame data using sklean

I am trying to scale down values in pandas data frame. The problem is that I have 291 dimensions, so scale down the values one by one is time consuming if we are to do it as follows:
from sklearn.preprocessing import StandardScaler
sclaer = StandardScaler()
scaler = sclaer.fit(dataframe['dimension_1'])
dataframe['dimension_1'] = scaler.transform(dataframe['dimension_1'])
Problem: This is only for one dimension, so how we can do this please for the 291 dimension in one shot?
You can pass in a list of the columns that you want to scale instead of individually scaling each column.
# convert the columns labelled 0 and 1 to boolean values
df.replace({0: False, 1: True}, inplace=True)
# make a copy of dataframe
scaled_features = df.copy()
# take the numeric columns i.e. those which are not of type object or bool
col_names = df.dtypes[df.dtypes != 'object'][df.dtypes != 'bool'].index.to_list()
features = scaled_features[col_names]
# Use scaler of choice; here Standard scaler is used
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features[col_names] = features
I normally use pipeline, since it can do multi-step transformation.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([('std_scale', StandardScaler())])
transformed_dataframe = num_pipeline.fit_transform(dataframe)
If you need to do more for transformation, e.g. fill NA,
you just add in the list (Line 3 of the code).
Note: The above code works, if the datatype of all columns is numeric. If not we need to
select only numeric columns
pass into the pipeline, then
put the result back to the original dataframe.
Here is the code for the 3 steps:
num_col = dataframe.dtypes[df.dtypes != 'object'][dataframe.dtypes != 'bool'].index.to_list()
df_num = dataframe[num_col] #1
transformed_df = num_pipeline.fit_transform(dataframe) #2
dataframe[num_col] = transformed_df #3

To join complicated pandas tables

I'm trying to join a dataframe of results from a statsmodels GLM to a dataframe designed to hold both univariate data and model results as models are iterated through. i'm having trouble figuring out how to grammatically join the two data sets.
I've consulted the pandas documentation found below to no luck:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging
This is difficult because of the out put of the model compared to the final table which holds values of each unique level of each unique variable.
See an example of what the data looks like with the code below:
import pandas as pd
df = {'variable': ['CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model'
,'channel_model','channel_model','channel_model']
, 'level': [0,100,200,250,500,750,1000, 'DIR', 'EA', 'IA']
,'value': [460955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36
,36388470.44,31805316.37]}
final_table = pd.DataFrame(df)
df2 = {'variable': ['intercept','C(channel_model)[T.EA]','C(channel_model)[T.IA]', 'CLded_model']
, 'coefficient': [-2.36E-14,-0.091195797,-0.244225888, 0.00174356]}
model_results = pd.DataFrame(df2)
After this is run you can see that for categorical variables, the value is incased in a few layers compared to the final_table. Numerical values such as CLded_model needs to be joined with the one coefficient it's associated with.
There is a lot to this and i'm not sure where to start.
Update: The following code produces the desired result:
d3 = {'variable': ['intercept', 'CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model'
,'CLded_model','channel_model','channel_model','channel_model']
, 'level': [None, 0,100,200,250,500,750,1000, 'DIR', 'EA', 'IA']
,'value': [None, 60955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36
,36388470.44,31805316.37]
, 'coefficient': [ -2.36E-14, 0.00174356, 0.00174356, 0.00174356, 0.00174356, 0.00174356 ,0.00174356
, 0.00174356,None, -0.091195797,-0.244225888, ]}
desired_result = pd.DataFrame(d3)
First you have to clean df2:
df2['variable'] = df2['variable'].str.replace("C\(","")\
.str.replace("\)\[T.", "-")\
.str.strip("\]")
df2
variable coefficient
0 intercept -2.360000e-14
1 channel_model-EA -9.119580e-02
2 channel_model-IA -2.442259e-01
3 CLded_model 1.743560e-03
Because you want to merge some of df1 on the level column and others not, we need to change df1 slightly to match df2:
df1.loc[df1['variable'] == 'channel_model', 'variable'] = "channel_model-"+df1.loc[df1['variable'] == 'channel_model', 'level']
df1
#snippet of what changed
variable level value
6 CLded_model 1000 1.458799e+07
7 channel_model-DIR DIR 1.049374e+07
8 channel_model-EA EA 3.638847e+07
9 channel_model-IA IA 3.180532e+07
Then we merge them:
df4 = df1.merge(df2, how = 'outer', left_on =['variable'], right_on = ['variable'])
And we get your result (except for the minor change in the variable name)

createOrReplaceTempView does not work on empty dataframe in pyspark2.0.0

I am trying to define a sql view on a pyspark dataframe(2.0.0) and getting errors like "Table or View Not found". What I am doing : 1. Create an empty dataframe 2. load data from different location into a temp dataframe 3. append the temp data frame to a main dataframe (the empty one) 4. define a sql view on the dataframe(which was empty earlier).
spark = SparkSession.builder.config(conf=SparkConf()).appName("mydailyjob").getOrCreate()
sc = spark.sparkContext
schema = StructType([StructField('vdna_id', StringType(), True),
StructField('miq_id', LongType(), True),
StructField('tags', IntegerType(), True),
StructField('dateserial', DateType(), True),
StructField('date_time', TimestampType(), True),
StructField('survey_id', StringType(), True),
StructField('ip', StringType(), True)])
brandsurvey_feed = sqlContext.createDataFrame(sc.emptyRDD(), schema)
# load brandsurvey feed data for each date in date_list
for loc in all_loc:
# load file from different location
bs_tmp = spark.read.csv(loc, schema=schema, sep='\t', header=True)
brandsurvey_feed = brandsurvey_feed.union(bs_tmp)
brandsurvey_feed.createOrReplaceTempView("brandsurvey_feed")
print(spark.sql("select * from brandsurvey_feed").show())
Folks, i think I found the reason. If we create a sql view on a dataframe with zero records and then access the table you will get an eerror "table or view does not exists". I would suggest keep a check before you define any sql view on the dataframe that it is not empty