I have the following 3 dataframes:
I want to append df_forecast to each of df2_CA and df2_USA using a for-loop. However when I run my code, df_forecast is not appending: df2_CA and df2_USA appear exactly as shown above.
df_list=[df2_CA, df2_USA]
for x in df_list:
pd.concat([x, df_forecast])
x['prod'].fillna(method='ffill',inplace=True)
x['country'].fillna(method='ffill',inplace=True)
x.loc['2020-03':'2020-05', 'rev'] = 200
Desired result:
Should I append df_forecast to each of df2_CA and df_USA or should I should the for-loop to expand the date range without creating df_forecast and if so, how do you code it?
I am new to PySpark and my objective is to use PySpark script in AWS Glue for:
reading a dataframe from input file in Glue => done
changing columns of some rows which satisfy a condition => facing issue
write the updated dataframe on the same schema into S3 => done
The task seems very simple, but I could not find a way to complete it and still facing different different issues with my changing code.
Till now, my code looks like this:
Transform2.printSchema() # input schema after reading
Transform2 = Transform2.toDF()
def updateRow(row):
# my logic to update row based on a global condition
#if row["primaryKey"]=="knownKey": row["otherAttribute"]= None
return row
LocalTransform3 = [] # creating new dataframe from Transform2
for row in Transform2.rdd.collect():
row = row.asDict()
row = updateRow(row)
LocalTransform3.append(row)
print(len(LocalTransform3))
columns = Transform2.columns
Transform3 = spark.createDataFrame(LocalTransform3).toDF(*columns)
print('Transform3 count', Transform3.count())
Transform3.printSchema()
Transform3.show(1,truncate=False)
Transform4 = DynamicFrame.fromDF(Transform3, glueContext, "Transform3")
print('Transform4 count', Transform4.count())
I tried using multiple things like:
using map to update existing rows in a lambda
using collect()
using createDataFrame() to create new dataframe
But faced errors in below steps:
not able to create new updated rdd
not able to create new dataframe from rdd using existing columns
Some errors in Glue I got, at different stages:
ValueError: Some of types cannot be determined after inferring
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. Traceback (most recent call last):
Any working code snippet or help is appreciated.
from pyspark.sql.functions import col, lit, when
Transform2 = Transform2.toDF()
withKeyMapping = Transform2.withColumn('otherAttribute', when(col("primaryKey") == "knownKey", lit(None)).otherwise(col('otherAttribute')))
This should work for your use-case.
I'm reading a csv file that has 7 columns
df = pd.read_csv('DataSet.csv',delimiter=',',usecols=['Wheel','Date','1ex','2ex','3ex','4ex','5ex'])
The problem is that the model I want to train with it, is complaining about the first 2 columns being Strings, so I want to drop them.
I first tried not to read the from the beginning with :
df = pd.read_csv('DataSet.csv',delimiter=',',usecols=['1ex','2ex','3ex','4ex','5ex'])
but it only shifted the values of two columns..so I decided to drop them.
The problem is that I'm only able to drop the first column 'Date' with
train_df.drop(columns=['Date'], inplace=True)
, train_df is a portion of df uses for testing. How do I go to also drop 'Wheel' column?
I tried
train_df.drop(labels=[["Date","Wheel"]], inplace=True)
but i get KeyError: "[('Date', 'Wheel')] not found in axis"
so I tried
train_df.drop(columns=[["Date","Wheel"]], index=1, inplace=True)
but I still get the same error.
I'm so new to Python I'm out of resources to solve this.
As always many thanks.
Try:
train_df.drop(columns=["Date","Wheel"], index=1, inplace=True)
See the examples in https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
This sounds like it should be REALLY easy to answer with Google but I'm finding it impossible to answer the majority of my nontrivial pandas/pytables questions this way. All I'm trying to do is to load about 3 billion records from about 6000 different CSV files into a single table in a single HDF5 file. It's a simple table, 26 fields, mixture of strings, floats and ints. I'm loading the CSVs with df = pandas.read_csv() and appending them to my hdf5 file with df.to_hdf(). I really don't want to use df.to_hdf(data_columns = True) because it looks like that will take about 20 days versus about 4 days for df.to_hdf(data_columns = False). But apparently when you use df.to_hdf(data_columns = False) you end up with some pile of junk that you can't even recover the table structure from (or so it appears to my uneducated eye). Only the columns that were identified in the min_itemsize list (the 4 string columns) are identifiable in the hdf5 table, the rest are being dumped by data type into values_block_0 through values_block_4:
table = h5file.get_node('/tbl_main/table')
print(table.colnames)
['index', 'values_block_0', 'values_block_1', 'values_block_2', 'values_block_3', 'values_block_4', 'str_col1', 'str_col2', 'str_col3', 'str_col4']
And any query like df = pd.DataFrame.from_records(table.read_where(condition)) fails with error "Exception: Data must be 1-dimensional"
So my questions are: (1) Do I really have to use data_columns = True which takes 5x as long? I was expecting to do a fast load and then index just a few columns after loading the table. (2) What exactly is this pile of garbage I get using data_columns = False? Is it good for anything if I need my table back with query-able columns? Is it good for anything at all?
This is how you can create an HDF5 file from CSV data using pytables. You could also use a similar process to create the HDF5 file with h5py.
Use a loop to read the CSV files with np.genfromtxt into a np array.
After reading the first CSV file, write the data with .create_table() method, referencing the np array created in Step 1.
For additional CSV files, write the data with .append() method, referencing the np array created in Step 1
End of loop
Updated on 6/2/2019 to read a date field (mm/dd/YYY) and convert to datetime object. Note changes to genfromtxt() arguments! Data used is added below the updated code.
import numpy as np
import tables as tb
from datetime import datetime
csv_list = ['SO_56387241_1.csv', 'SO_56387241_2.csv' ]
my_dtype= np.dtype([ ('a',int),('b','S20'),('c',float),('d',float),('e','S20') ])
with tb.open_file('SO_56387241.h5', mode='w') as h5f:
for PATH_csv in csv_list:
csv_data = np.genfromtxt(PATH_csv, names=True, dtype=my_dtype, delimiter=',', encoding=None)
# modify date in fifth field 'e'
for row in csv_data :
datetime_object = datetime.strptime(row['my_date'].decode('UTF-8'), '%m/%d/%Y' )
row['my_date'] = datetime_object
if h5f.__contains__('/CSV_Data') :
dset = h5f.root.CSV_Data
dset.append(csv_data)
else:
dset = h5f.create_table('/','CSV_Data', obj=csv_data)
dset.flush()
h5f.close()
Data for testing:
SO_56387241_1.csv:
my_int,my_str,my_float,my_exp,my_date
0,zero,0.0,0.00E+00,01/01/1980
1,one,1.0,1.00E+00,02/01/1981
2,two,2.0,2.00E+00,03/01/1982
3,three,3.0,3.00E+00,04/01/1983
4,four,4.0,4.00E+00,05/01/1984
5,five,5.0,5.00E+00,06/01/1985
6,six,6.0,6.00E+00,07/01/1986
7,seven,7.0,7.00E+00,08/01/1987
8,eight,8.0,8.00E+00,09/01/1988
9,nine,9.0,9.00E+00,10/01/1989
SO_56387241_2.csv:
my_int,my_str,my_float,my_exp,my_date
10,ten,10.0,1.00E+01,01/01/1990
11,eleven,11.0,1.10E+01,02/01/1991
12,twelve,12.0,1.20E+01,03/01/1992
13,thirteen,13.0,1.30E+01,04/01/1993
14,fourteen,14.0,1.40E+01,04/01/1994
15,fifteen,15.0,1.50E+01,06/01/1995
16,sixteen,16.0,1.60E+01,07/01/1996
17,seventeen,17.0,1.70E+01,08/01/1997
18,eighteen,18.0,1.80E+01,09/01/1998
19,nineteen,19.0,1.90E+01,10/01/1999
I have a DataFrame and I want to display the frequencies for certain values in a certain Series using pd.Series.value_counts().
The problem is that I only see truncated results in the output. I'm coding in Jupyter Notebook.
I have tried unsuccessfully a couple of methods:
df = pd.DataFrame(...) # assume df is a DataFrame with many columns and rows
# 1st method
df.col1.value_counts()
# 2nd method
print(df.col1.value_counts())
# 3rd method
vals = df.col1.value_counts()
vals # neither print(vals) doesn't work
# All output something like this
value1 100000
value2 10000
...
value1000 1
Currently this is what I'm using, but it's quite cumbersome:
print(df.col1.value_counts()[:50])
print(df.col1.value_counts()[50:100])
print(df.col1.value_counts()[100:150])
# etc.
Also, I have read this related Stack Overflow question, but haven't found it helpful.
So how to stop outputting truncated results?
If you want to print all rows:
pd.options.display.max_rows = 1000
print(vals)
If you want to print all rows only once:
with pd.option_context("display.max_rows", 1000):
print(vals)
Relevant documentation here.
I think you need option_context and set to some large number, e.g. 999. Advatage of solution is:
option_context context manager has been exposed through the top-level API, allowing you to execute code with given option values. Option values are restored automatically when you exit the with block.
#temporaly display 999 rows
with pd.option_context('display.max_rows', 999):
print (df.col1.value_counts())