Pyspark Schema update/alter Dataframe - dataframe

I need to read a csv file from S3 ,it has string,double data but i will read as string which will provide a dynamic frame of only string. I want to do below for each row
concatenate few columns and create new columns
Add new columns
Convert value in 3rd column from string to date
Convert values of column 4,5,6 individually from string to decimal
Storename,code,created_date,performancedata,accumulateddata,maxmontlydata
GHJ 0,GHJ0000001,2020-03-31,0015.5126-,0024.0446-,0017.1811-
MULT,C000000001,2020-03-31,0015.6743-,0024.4533-,0018.0719-
Below is the code that I have written so far
def ConvertToDec(myString):
pattern = re.compile("[0-9]{0,4}[\\.]?[0-9]{0,4}[-]?")
myString=myString.strip()
doubleVal="";
if myString and not pattern.match(myString):
doubleVal=-9999.9999;
else:
doubleVal=-Decimal(myString);
return doubleVal
def rowwise_function(row):
row_dict = row.asDict()
data='d';
if not row_dict['code']:
data=row_dict['code']
else:
data='CD'
if not row_dict['performancedata']:
data= data +row_dict['performancedata']
else:
data=data + 'HJ'
// new columns
row_dict['LC_CODE']=data
row_dict['CD_CD']=123
row_dict['GBL']=123.345
if rec["created_date"]:
rec["created_date"]= convStr =datetime.datetime.strptime(rec["created_date"], '%Y-%m-%d')
if rec["performancedata"]
rec["performancedata"] = ConvertToDec(rec["performancedata"])
newrow = Row(**row_dict)
return newrow
store_df = spark.read.option("header","true").csv("C:\\STOREDATA.TXT", sep="|")
ratings_rdd = store_df.rdd
ratings_rdd_new = ratings_rdd.map(lambda row: rowwise_function(row))
updatedDF=spark.createDataFrame(ratings_rdd_new)
Basically, I am creating almost new DataFrame. My questions are below -
is this right approach ?
Since i am my changing schema mostly is there any other approach

Use Spark dataframes/sql, why use rdd? You don't need to perform any low level data operations, all are column level so dataframes are easier/efficient to use.
To create new columns - .withColumn(<col_name>, <expression/value>) (refer)
All the if's can be made .filter (refer)
The whole ConvertToDec can be written better using strip and ast module or float.

Related

Convert a spark dataframe to a column

I have a org.apache.spark.sql.DataFrame and I would like to convert it into a column: org.apache.spark.sql.Column.
So basically, this is my dataframe:
val filled_column2 = x.select(first(("col1"),ignoreNulls = true).over(window)) that I want to convert, it into an sql spark column. Anyone could help on that ?
Thank you,
#Jaime Caffarel: this is exactly what i am trying to do, this will give you more visibility. You may also check the error msg in the 2d screenshot
From the documentation of the class org.apache.spark.sql.Column
A column that will be computed based on the data in a DataFrame. A new
column is constructed based on the input columns present in a
dataframe:
df("columnName") // On a specific DataFrame.
col("columnName") // A generic column no yet associcated
with a DataFrame. col("columnName.field") // Extracting a
struct field col("a.column.with.dots") // Escape . in column
names. $"columnName" // Scala short hand for a named
column. expr("a + 1") // A column that is constructed
from a parsed SQL Expression. lit("abc") // A
column that produces a literal (constant) value.
If filled_column2 is a DataFrame, you could do:
filled_column2("col1")
******** EDITED AFTER CLARIFICATION ************
Ok, it seems to me that what you are trying to do is a JOIN operation. Assuming that the product_id is a unique key per each row, I would do something like this:
val filled_column = df.select(df("product_id"), last(("last_prev_week_nopromo"), ignoreNulls = true) over window)
This way, you are also selecting the product_id that you will use as key. Then, you can do the following
val promo_txn_cnt_seas_df2 = promo_txn_cnt_seas_df1
.join(filled_column, promo_txn_cnt_seas_df1("product_id") === filled_column("driver_id"), "inner")
// orderBy("product_id", "week")... (the rest of the operations)
Is this what you are trying to achieve?

How to truncate a table in PySpark?

In one of my projects, I need to check if an input dataframe is empty or not. If it is not empty, I need to do a bunch of operations and load some results into a table and overwrite the old data there.
On the other hand, if the input dataframe is empty, I do nothing and simply need to truncate the old data in the table. I know how to insert data in with overwrite but don't know how to truncate table only. I searched existing questions/answers and no clear answer found.
driver = 'com.microsoft.sqlserver.jdbc.SQLServerDriver'
stage_url = 'jdbc:sqlserver://server_name\DEV:51433;databaseName=project_stage;user=xxxxx;password=xxxxxxx'
if input_df.count()>0:
# Do something here to generate result_df
print(" write to table ")
write_dbtable = 'Project_Stage.StageBase.result_table'
write_df = result_df
write_df.write.format('jdbc').option('url', stage_url).option('dbtable', write_dbtable). \
option('truncate', 'true').mode('overwrite').option('driver',driver).save()
else:
print('no account to process!')
query = """TRUNCATE TABLE Project_Stage.StageBase.result_table"""
### Not sure how to run the query
Truncating is probably easiest done like this:
write_df = write_df.limit(0)
Also, for better performance, instead of input_df.count() > 0 you should use
Spark 3.2 and below: len(input_df.head(1)) > 0
Spark 3.3+: ~df.isEmpty()

combining CSV files from Covid-data

I want to combine the CSV files from the Johns Hopkins Covid Data (e.g. https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/05-10-2020.csv & https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-23-2020.csv).
I already managed to load the files into a DataFrame as well as sanitizing the header (_ vs. / in some names). Now I want to pick one column (e.g. Confirmed), rename it to the day of the file and then combine those CSV files to get a progress over time.
This merge needs to be done by state_province. In both frames, the key may not be present. How can I do this? I experimented with rightjoin and outerjoin, but didn't have any success. Can someone point me the right way please?
I initially didn't want to share the code that I have so far because I didn't want to guide to a specific solution - but here it is. It is copied together from several Jupyter cells.
using Dates
start = Dates.Date(2020,1,22) #begin of recording
now = Dates.Date(Dates.now())- Dates.Day(1) #today
date_range = collect(start:Dates.Day(1):now) #create a date range with 1 element per day
prefix = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
suffix = ".csv"
function create_url(date)
return prefix * Dates.format(date, "mm-dd-YYYY") * suffix
end
function cleanup_column_names(name)
if name == "Country/Region" || name == "Country_Region"
return "country"
elseif name == "Province/State" || name == "Province_State"
return "state"
else
return name
end
end
using CSV
using HTTP
using DataFrames
selected_data = "Confirmed"
date = date_range[1]
data = DataFrame(CSV.File(HTTP.get(create_url(date)).body))
DataFrames.rename!(cleanup_column_names, data)
DataFrames.select!(data,["state", "country", selected_data])
DataFrames.rename!(data, 3 => Dates.format(date, "YYYY-mm-dd"))
Regards
Tobias
I am relatively new to Julia, so take my answer with a bit of scepticism:
First, we wrap the DataFrame creation into a function:
function prepare_date_df(date)
data = DataFrame(CSV.File(HTTP.get(create_url(date)).body))
DataFrames.rename!(cleanup_column_names, data)
DataFrames.select!(data,["state", "country", selected_data])
DataFrames.rename!(data, 3 => Dates.format(date, "YYYY-mm-dd"))
return data
end
Let's create our first Dataframe:
df = prepare_date_df(date_range[1])
Now, let's iterate over all the other dates, create a dataframe for each date and merge this with our first dataframe:
for date in date_range[2:end]
df_new = prepare_date_df(date)
df = outerjoin(df, df_new, on = [:state, :country])
end
This works fine for the first two months, but with the growing Dataframes, it suddenly gets very slow (and even hangs?). So I would be very interested in a more performative answer!

Using to_datetime several columns names

I am working with several CSV's that first N columns are information and then the next Ms (M is big) columns are information regarding a date.
This is the dataframe picture
I need to set just the columns between N+1 to N+M - 1 columns name to date format.
I tried this, in this case N+1 = 5, no matter M, I suppose that I can use -1 to not affect the last column name.
ContDiarios.columns[5:-1] = pd.to_datetime(ContDiarios.columns[5:-1])
but I get the following error:
TypeError: Index does not support mutable operations
The way you are doing is not feasable. Please try this way
def convert(x):
try:
return pd.to_datetime(x)
except:
return x
x.columns = map(convert,x.columns)
Or you can also use df.rename property to convert it.

How do I reverse each value in a column bit wise for a hex number?

I have a dataframe which has a column called hexa which has hex values like this. They are of dtype object.
hexa
0 00802259AA8D6204
1 00802259AA7F4504
2 00802259AA8D5A04
I would like to remove the first and last bits and reverse the values bitwise as follows:
hexa-rev
0 628DAA592280
1 457FAA592280
2 5A8DAA592280
Please help
I'll show you the complete solution up here and then explain its parts below:
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
list_of_bits = [i+j for i, j in zip(trimmed_bits[::2], trimmed_bits[1::2])]
reversed_bits = [list_of_bits[-i] for i in range(1,len(list_of_bits)+1)]
return ''.join(reversed_bits)
df['hexa-rev'] = df['hexa'].apply(lambda x: reverse_bits(x))
There are possibly a couple ways of doing it, but this way should solve your problem. The general strategy will be defining a function and then using the apply() method to apply it to all values in the column. It should look something like this:
df['hexa-rev'] = df['hexa'].apply(lambda x: reverse_bits(x))
Now we need to define the function we're going to apply to it. Breaking it down into its parts, we strip the first and last bit by indexing. Because of how negative indexes work, this will eliminate the first and last bit, regardless of the size. Your result is a list of characters that we will join together after processing.
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
The second line iterates through the list of characters, matches the first and second character of each bit together, and then concatenates them into a single string representing the bit.
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
list_of_bits = [i+j for i, j in zip(trimmed_bits[::2], trimmed_bits[1::2])]
The second to last line returns the list you just made in reverse order. Lastly, the function returns a single string of bits.
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
list_of_bits = [i+j for i, j in zip(trimmed_bits[::2], trimmed_bits[1::2])]
reversed_bits = [list_of_bits[-i] for i in range(1,len(list_of_bits)+1)]
return ''.join(reversed_bits)
I explained it in reverse order, but you want to define this function that you want applied to your column, and then use the apply() function to make it happen.