nested pytables - pytables

Suppose you are passing a dictionary to the pytable constructor:
h5f.createTable('/','table',{'col1':Float64Col(pos=0),'col2':StringCol(16,pos=1)})
I have the following three beginner's questions related to nested pytables:
1) How do you use a dictionary descriptor for creating a nested pytable?
2) How do you assign positions for the nested columns?
If the top-level column has position pos=1, do you start numbering its subcolumns from 0?
3) How do you assign rows to the nested column?
Thanks for helping!

I've been dynamically creating pytables description using python type(). This should at least get you going...
from tables import *
h5_file = openFile('test_nested_table.h5', 'w')
nested_table_fields = {}
nested_table_fields['x'] = Float64Col(dflt=1, pos=0)
nested_table = type('nested_table', (IsDescription,), nested_table_fields)
main_table_fields = {}
main_table_fields['y'] = Float64Col(dflt=1, pos=0)
main_table_fields['nested_table'] = nested_table
main_table = type('main_table', (IsDescription,), main_table_fields)
h5_table = h5_file.createTable('/', 'nested_table_example', main_table)
print repr(h5_table)

Related

Convert a spark dataframe to a column

I have a org.apache.spark.sql.DataFrame and I would like to convert it into a column: org.apache.spark.sql.Column.
So basically, this is my dataframe:
val filled_column2 = x.select(first(("col1"),ignoreNulls = true).over(window)) that I want to convert, it into an sql spark column. Anyone could help on that ?
Thank you,
#Jaime Caffarel: this is exactly what i am trying to do, this will give you more visibility. You may also check the error msg in the 2d screenshot
From the documentation of the class org.apache.spark.sql.Column
A column that will be computed based on the data in a DataFrame. A new
column is constructed based on the input columns present in a
dataframe:
df("columnName") // On a specific DataFrame.
col("columnName") // A generic column no yet associcated
with a DataFrame. col("columnName.field") // Extracting a
struct field col("a.column.with.dots") // Escape . in column
names. $"columnName" // Scala short hand for a named
column. expr("a + 1") // A column that is constructed
from a parsed SQL Expression. lit("abc") // A
column that produces a literal (constant) value.
If filled_column2 is a DataFrame, you could do:
filled_column2("col1")
******** EDITED AFTER CLARIFICATION ************
Ok, it seems to me that what you are trying to do is a JOIN operation. Assuming that the product_id is a unique key per each row, I would do something like this:
val filled_column = df.select(df("product_id"), last(("last_prev_week_nopromo"), ignoreNulls = true) over window)
This way, you are also selecting the product_id that you will use as key. Then, you can do the following
val promo_txn_cnt_seas_df2 = promo_txn_cnt_seas_df1
.join(filled_column, promo_txn_cnt_seas_df1("product_id") === filled_column("driver_id"), "inner")
// orderBy("product_id", "week")... (the rest of the operations)
Is this what you are trying to achieve?

Pyspark Schema update/alter Dataframe

I need to read a csv file from S3 ,it has string,double data but i will read as string which will provide a dynamic frame of only string. I want to do below for each row
concatenate few columns and create new columns
Add new columns
Convert value in 3rd column from string to date
Convert values of column 4,5,6 individually from string to decimal
Storename,code,created_date,performancedata,accumulateddata,maxmontlydata
GHJ 0,GHJ0000001,2020-03-31,0015.5126-,0024.0446-,0017.1811-
MULT,C000000001,2020-03-31,0015.6743-,0024.4533-,0018.0719-
Below is the code that I have written so far
def ConvertToDec(myString):
pattern = re.compile("[0-9]{0,4}[\\.]?[0-9]{0,4}[-]?")
myString=myString.strip()
doubleVal="";
if myString and not pattern.match(myString):
doubleVal=-9999.9999;
else:
doubleVal=-Decimal(myString);
return doubleVal
def rowwise_function(row):
row_dict = row.asDict()
data='d';
if not row_dict['code']:
data=row_dict['code']
else:
data='CD'
if not row_dict['performancedata']:
data= data +row_dict['performancedata']
else:
data=data + 'HJ'
// new columns
row_dict['LC_CODE']=data
row_dict['CD_CD']=123
row_dict['GBL']=123.345
if rec["created_date"]:
rec["created_date"]= convStr =datetime.datetime.strptime(rec["created_date"], '%Y-%m-%d')
if rec["performancedata"]
rec["performancedata"] = ConvertToDec(rec["performancedata"])
newrow = Row(**row_dict)
return newrow
store_df = spark.read.option("header","true").csv("C:\\STOREDATA.TXT", sep="|")
ratings_rdd = store_df.rdd
ratings_rdd_new = ratings_rdd.map(lambda row: rowwise_function(row))
updatedDF=spark.createDataFrame(ratings_rdd_new)
Basically, I am creating almost new DataFrame. My questions are below -
is this right approach ?
Since i am my changing schema mostly is there any other approach
Use Spark dataframes/sql, why use rdd? You don't need to perform any low level data operations, all are column level so dataframes are easier/efficient to use.
To create new columns - .withColumn(<col_name>, <expression/value>) (refer)
All the if's can be made .filter (refer)
The whole ConvertToDec can be written better using strip and ast module or float.

How to drop multiple column names given in a list from Spark DataFrame?

I have a dynamic list which is created based on value of n.
n = 3
drop_lst = ['a' + str(i) for i in range(n)]
df.drop(drop_lst)
But the above is not working.
Note:
My use case requires a dynamic list.
If I just do the below without list it works
df.drop('a0','a1','a2')
How do I make drop function work with list?
Spark 2.2 doesn't seem to have this capability. Is there a way to make it work without using select()?
You can use the * operator to pass the contents of your list as arguments to drop():
df.drop(*drop_lst)
You can give column name as comma separated list e.g.
df.drop("col1","col11","col21")
This is how drop specified number of consecutive columns in scala:
val ll = dfwide.schema.names.slice(1,5)
dfwide.drop(ll:_*).show
slice take two parameters star index and end index.
Use simple loop:
for c in drop_lst:
df = df.drop(c)
You can use drop(*cols) 2 ways .
df.drop('age').collect()
df.drop(df.age).collect()
Check the official documentation DataFrame.drop

convert Int64Index to Int

I'm iterating through a dataframe (called hdf) and applying changes on a row by row basis. hdf is sorted by group_id and assigned a 1 through n rank on some criteria.
# Groupby function creates subset dataframes (a dataframe per distinct group_id).
grouped = hdf.groupby('group_id')
# Iterate through each subdataframe.
for name, group in grouped:
# This grabs the top index for each subdataframe
index1 = group[group['group_rank']==1].index
# If criteria1 == 0, flag all rows for removal
if(max(group['criteria1']) == 0):
for x in range(rank1, rank1 + max(group['group_rank'])):
hdf.loc[x,'remove_row'] = 1
I'm getting the following error:
TypeError: int() argument must be a string or a number, not 'Int64Index'
I get the same error when I try to cast rank1 explicitly I get the same error:
rank1 = int(group[group['auction_rank']==1].index)
Can someone explain what is happening and provide an alternative?
The answer to your specific question is that index1 is an Int64Index (basically a list), even if it has one element. To get that one element, you can use index1[0].
But there are better ways of accomplishing your goal. If you want to remove all of the rows in the "bad" groups, you can use filter:
hdf = hdf.groupby('group_id').filter(lambda group: group['criteria1'].max() != 0)
If you only want to remove certain rows within matching groups, you can write a function and then use apply:
def filter_group(group):
if group['criteria1'].max() != 0:
return group
else:
return group.loc[other criteria here]
hdf = hdf.groupby('group_id').apply(filter_group)
(If you really like your current way of doing things, you should know that loc will accept an index, not just an integer, so you could also do hdf.loc[group.index, 'remove_row'] = 1).
call tolist() on Int64Index object. Then the list can be iterated as int values.
simply add [0] to insure the getting the first value from the index
rank1 = int(group[group['auction_rank']==1].index[0])

Define column names of a table using pytables using a for loop inside the class definition

We know that if we need to define the column names of a table using pytables we can do it by the following way:
class Project(IsDescription):
alpha = StringCol(20)
beta = StringCol(20)
gamma = StringCol(20)
where alpha, beta and gamma are the desired column names of the table.
But suppose I would like to use a list "ColumnNames_list" which contains the column names as follows:
ColumnNames_list[0] = alpha, ColumnNames_list[1] = beta, ColumnNames_list[2] = gamma
Then how should I define the above class "Project"?
I tried with the following:
ColumnNames_list = []
ColumnNames_list[0] = alpha
ColumnNames_list[1] = beta
ColumnNames_list[2] = gamma
class Project(IsDescription):
for i in range (0, 10):
ColumnNames_list[i] = StringCol(20)
It's showing the error:
TypeError: Passing an incorrect value to a table column. Expected a Col (or subclass) instance and got: "2". Please make use of the Col(), or descendant, constructor to properly initialize columns.
Define the variable first outside the loop:
ColumnNames_list = []
for i in range (0, 10):
ColumnNames_list.append(StringCol(20))
The reason I'm using append() rather than ColumnNames_list[i] = StringCol(20) is because you can't assign to an index that doesn't exist. Trying to assign ColumnNames_list[1] before it exists throws an IndexError.