How to set the setCheckpoint in pyspark - apache-spark-sql

I don't know much spark. On the top of the code I have
from pysaprk.sql import SparkSession
import pyspark.sql.function as f
spark = SparkSession.bulder.appName(‘abc’).getOrCreate()
H = sqlContext.read.parquet(‘path to hdfs file’)
H has about 30 million records and will be used in a loop. So I wrote
H.persist().count()
I have a list of 50 strings L = [s1,s2,…,s50], each of which is used to build a small data frame out of H, which are supposed to be stacked on top each other. I created an empty dataframe Z
schema = StructType([define the schema here])
Z = spark.createDataFrame([],schema)
Then comes the loop
for st in L:
K = process H using st
Z = Z.union(H)
where K has at most 20 rows. When L has only 2 or 3 elements this code works. But for length of L = 50, it never ends. I learned today that I can use checkpoints. So I created a hadoop path and right above where the loop starts I wrote:
SparkContext.setCheckpointDir(dirName=‘path/to/checkpoint/dir’)
But I get the following error: missing 1 required positional argument: ‘self’. I need to know how to fix the error and how to modify the loop to incorporate the checkpoint.

Create an object for the SparkContext and then,you need not specify the self parameter. Also, remove the name of the parameter which is not needed.
A code like below works:
from pyspark import SparkConf
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate(SparkConf())
sc.setCheckpointDir(‘path/to/checkpoint/dir’)

Related

How to construct a temporal network using python

I have a data for different stations at different days:
Station_start Station_end Day Hour
A B 1 14
B C 1 10
C A 1 10
B A 2 15
A C 2 13
D E 2 12
E B 2 14
F C 3 12
I want to construct a dynamic/interactive network, where the network connections change according to day.
I found an example of it in the tutorial of pathpy.
But, How to load a pandas dataframe to it with nodes Station_start and Station_end?
Here is a way to do what you want. First, load your data into a pandas dataframe using pd.read_fwf (I saved your data in a file called data_net.txt).
Then incrementally add edges to your temporal network using pp.add_edge. Run t in a cell to see the animation.
See code below for more details:
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
import pathpy as pp
df=pd.read_fwf('data_net.txt')
t = pp.TemporalNetwork()
[t.add_edge(df['Station_start'][i],df['Station_end'][i],int(df['Day'][i])) for i in range(len(df))]
t # run t in a cell to start the animation
Below is what this code returns. Based on the link you gave, you can also control the speed of the animation by styling the network with pathpy.

Getting same value from list in dataframe column using Python

I have dataframe in which there 3 columns, Now, I added one more column and in which I am adding unique values using random function.
I created list variable and using for loop I am adding random string in that list variable
after that, I created another loop in which I am extracting value of list and adding it in column's value.
But, Same value is adding in each row everytime.
df = pd.read_csv("test.csv")
lst = []
for i in range(20):
randColumn = ''.join(random.choice(string.ascii_uppercase + string.digits)
for i in range(20))
lst.append(randColumn)
for j in lst:
df['randColumn'] = j
print(df)
#Output.......
A B C randColumn
0 1 2 3 WHI11NJBNI8BOTMA9RKA
1 4 5 6 WHI11NJBNI8BOTMA9RKA
Could you please help me to fix this that Why each row has same value from list.
Updated to work correctly with any type of column in df.
If I got your question clearly, you can use method zip of rdd to achieve your goals.
from pyspark.sql import SparkSession, Row
import pyspark.sql.types as t
lst = []
for i in range(2):
rand_column = ''.join(random.choice(string.ascii_uppercase + string.digits) for i in range(20))
# Adding random strings as Row to list
lst.append(Row(random=rand_column))
# Making rdd from random strings array
random_rdd = sparkSession.sparkContext.parallelize(lst)
res = df.rdd.zip(random_rdd).map(lambda rows: Row(**(rows[0].asDict()), **(rows[1].asDict()))).toDF()

Dask appropriate for my goal? ```Compute()``` taking very long

I am doing the following in Dask as the df dataframe has 7 million rows and 50 columns so pandas is extremely slow. However, I might not be using Dask correctly or Dask might not be appropriate for my goal. I need to do some preprocessing on the df dataframe, which is mainly creating some new columns. And then eventually saving the df (I am saving to csv but I have also tried parquet). However, before I save, I believe I have to do compute(). And compute() is taking very long -- I left it running for 3 hours and it still wasn't done. I tried to persist() throughout the calculations but persist() also took a long time. Is this expected with Dask given the size of my data? Could this be because of the number of partitions (I have 20 logical processor and dask is using 24 partitions -- I have 128 GB of memory if this helps too)? Is there something I could do to speed this up?
import dask.dataframe as dd
import numpy as np
import pandas as pd
from re import match

from dask_ml.preprocessing import LabelEncoder



df1 = dd.read_csv("data1.csv")
df2 = dd.read_csv("data2.csv")
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
df['actual__adj'] = (df['actual'] * df['travel'] + 809 * df['stopped']) / (
df['travel_time'] + df['stopped_time'])
df['c_adj'] = 1 - df['actual_adj'] / df['free']

df['stopped_tom'] = 1 * (df['stopped'] > 0)

def func(df):
df = df.sort_values('region')
df['first_established'] = 1 * (df['region_d']==df['region_d'].min())
df['last_established'] = 1 * (df['region_d']==df['region_d'].max())
df['actual_established'] = df['noted_timeframe'].shift(1, fill_value=0)
df['actual_established_2'] = df['noted_timeframe'].shift(-1, fill_value=0)
df['time_1'] = df['time_book'].shift(1, fill_value=0)
df['time_2'] = df['time_book'].shift(-1, fill_value=0)
df['stopped_investing'] = df['stopped'].shift(1, fill_value=1)
return df

df = df.groupby('country').apply(func).reset_index(drop=True)
df['actual_diff'] = np.abs(df['actual'] - df['actual_book'])
df['length_diff'] = np.abs(df['length'] - df['length_book'])

df['Investment'] = df['lor_index'].values * 1000
df = df.compute().to_csv("path")
Saving to csv or parquet will by default trigger computation, so the last line should be:
df = df.to_csv("path_*.csv")
The asterisk is needed to specify the pattern of csv file names (each partition is saved into a separate file, unless you specify single_file=True).
My guess is that most of the computation time is spent on this step:
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
If one of the dfs is small enough to fit in memory, then it would be good to keep it as a pandas dataframe, see further tips in the documentation.

Pandas Boolean selection updated?

I was used to getting a single number, which told me how many cases are TRUE for either for the conditions in the code. However, since I used conda update all I now get a list of values with either 0 or 1. I wonder what is now the simplest method in pandas to get this task done. I guess that this is a pandas update. I did a google search but could not find that they changed boolean indexing. What is the easiest way to get this sum of booleans (I know how to get it but I cannot imagine that this extra step is required).
import pandas as pd
import numpy as np
x = np.random.randint(10,size=10)
y = np.random.randint(10,size=10)
d ={}
d['x'] = x
d['y'] = y
df = pd.DataFrame(d)
sum([df['x']>=6] or [df['y']<=3])
You need to use vectorized or |:
(df.x.ge(6) | df.y.le(3)).sum()
# 9
Or: ((df.y <= 3) | (df.x >= 6)).sum(), sum((df.y <= 3) | (df.x >= 6)).

There are three problems(Load database, loop, and append series)

Unlike when I started, I found this problem to be a more difficult problem than I thought.
I want to refer to a particular column content from the SQLite database, make it into a Series, and then combine it into a single data frame.
I have tried like this but faild:
import pandas as pd
from pandas import Series, DataFrame
import sqlite3
con = sqlite3.connect("C:/Users/Kun/Documents/Dashin/data.db") #my sqldb
tmplist = ['A003060','A003070'] #db contains that table,I decided to call
#only two for practice.
for i in tmplist:
tmpSeries =pd.Series([])
listSeries = pd.read_sql("SELECT * FROM %s " %(i), con , index_col =
None)['Close'].head(5)
tmpSeries2 = tmpSeries.append(listSeries)
print(tmpSeries2)
that code result show only dummy thing like this:
0 7150.0
1 6770.0
2 7450.0
3 7240.0
4 6710.0
dtype: float64
0 14950.0
1 15500.0
2 15000.0
3 14800.0
4 14500.0
What I want to do is like this:
A003060 A003070
0 7150.0 14950.0
1 6770.0 15500.0
2 7450.0 15000.0
3 7240.0 14800.0
4 6710.0 14500.0
I had a similar question ahead and got a answer. But The last question is
using predefined variables. But I must use loop because I have to deal with a series of large databases. I have already tried another effort using dataframe.append, transpose(). But I failed.
I would appreciate some small hints. Thank you.
To append pandas series using for loop
I think you can create list, then append data and last use concat:
dfs = []
for i in tmplist:
tmpSeries =pd.Series([])
listSeries = pd.read_sql("SELECT * FROM %s " %(i) con,index_col = None)['Close'].head(5)
dfs.append(listSeries)
df = pd.concat(dfs, axis=1, keys=tmplist)
print(df)