I have a set of CSV files, each for one year of data, with YEAR column in each. I want to convert them into single parquet dataset, partitioned by year, for later use in pandas. The problem is that dataframe with all years combined is too large to fit in memory. Is it possible to write parquet partitions iteratively, one by one?
I am using fastparquet as engine.
Simplified code example. This code blows up memory usage and crashes.
df = []
for year in range(2000, 2020):
df.append(pd.read_csv(f'{year}.csv'))
df = pd.concat(df)
df.to_parquet('all_years.pq', partition_cols=['YEAR'])
I tried to write years one by one, like so.
for year in range(2000, 2020):
df = pd.read_csv(f'{year}.csv')
df.to_parquet('all_years.pq', partition_cols=['YEAR'])
The data files are all there in their respective YEAR=XXXX directories, but when I try to read such a dataset, I only get the last year. Maybe it is possible to fix the parquet metadata after writing separate partitions?
I think I found a way to do it using fastparquet.writer.merge() function. Parquet files are written one by one for each year, leaving out the YEAR column and giving them appropriate names, and then the merge() function creates top level _metadata file.
The code below is a gist, as I leave out many details from my concrete use case.
years = range(2000, 2020)
for year in years:
df = pd.read_csv(f'{year}.csv').drop(columns=['YEAR'])
df.to_parquet(f'all_years.pq/YEAR={year}')
fastparquet.writer.merge([f'all_years.pq/YEAR={y}' for y in years])
df_all = pd.read_parquet('all_years.pq')
Related
This question is relevant to my previous question at aggregate multiple columns in sql table as json or array
I post some updates/follow-up questions here because I got a new problem.
I would like to query a table on presto database from pyspark hive and create a pyspark dataframe based on it. I have to save the dataframe to s3 faster and then read it as parquet (or any other formats as long as it can be read/written fast) from s3 efficiently.
In order to keep the size as small as possible, I have aggregated some columns into a json object.
The original table (> 10^9 rows, some columns (e.g. obj_desc) may have more than 30 English words):
id. cat_name. cat_desc. obj_name. obj_desc. obj_num
1. furniture living office desk 4 corners 1.5.
1 furniture. living office. chair. 4 legs. 0.8
1. furniture. restroom. tub. white wide. 2.7
1. cloth. fashion. T-shirt. black large. 1.1
I have aggregated some columns to json object.
aggregation_cols = ['cat_name','cat_desc','obj_name','obj_description', 'obj_num'] # they are all string
df_temp = df.withColumn("cat_obj_metadata", F.to_json(F.struct([x for x in aggregation_cols]))).drop(*agg_cols)
df_temp_agg = df_temp.groupBy('id').agg(F.collect_list('cat_obj_metadata').alias('cat_obj_metadata'))
df_temp_agg.cache()
df_temp_agg.printSchema()
# df_temp_agg.count() # this cost a very long time but still cannot return result so I am not sure how large it is.
df_temp_agg.repartition(1024) # not sure what optimal one should be?
df_temp_agg.write.parquet(s3_path, mode='overwrite') # this cost a long time (> 12 hours) but no return.
I work on a m4.4xlarge with 4 nodes and all cores look not busy.
I also checked the s3 bucket, no folder created at "s3_path".
For other small dataframe, I can see the "s3_path" can be created when "write.parquet()" is run. But, for this large dataframe, nothing fodlers or files are created on "s3_path".
Because the
df_meta_agg.write.parquet()
never returns, I am. not sure what errors could happen here on spark cluster or on s3.
Anybody could help me about this ? thanks
I cannot share my actual code or data, unfortunately, as it is proprietary, but I can produce a MWE if the problem isn't clear to readers from the text.
I am working with a dataframe containing ~50 million rows, each of which contains a large XML document. From each XML document, I extract a list of statistics relating to the number of occurrences and hierarchical relationships between tags (nothing like undocumented XML formats to brighten one's day). I can express these statistics in dataframes, and I can combine these dataframes over multiple documents using standard operations like GROUP BY/SUM and DISTINCT. The goal is to extract the statistics for all 50 million documents and express them in a single dataframe.
The problem is that I don't know how to efficiently generate 50 million dataframes from each row of one dataframe in Spark, or how to tell Spark to reduce a list of 50 million dataframes to one dataframe using binary operators. Are there standard functions that do these things?
So far, the only workaround I have found is massively inefficient (storing the data as a string, parsing it, doing the computations, and then converting it back into a string). It would take weeks to finish using this method, so it isn't practical.
The extractions and statistical data from each XML response for each row can be stored in additional columns of the row itself. That way spark should be able to do the processes in its multiple executors improving the performance.
Here is a pseudocode.
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType, DateType, FloatType, ArrayType
def extract_metrics_from_xml(row):
j = row['xmlResponse'] # assuming your xml column name is xmlResponse
# perform your xml extractions and computations for the xmlResponse in python
...
load_date = ...
stats_data1 = ...
return Row(load_date, stats_data1, stats_data2, stats_group)
schema = schema = StructType([StructField('load_date', DateType()),
StructField('stats_data1', FloatType()),
StructField('stats_data2', ArrayType(IntegerType())),
StructField('stats_group', StringType())
])
df_with_xml_stats = original_df.rdd\
.map(extract_metrics_from_xml)\
.toDF(schema=schema, sampleRatio=1)\
.cache()
I have a pandas dataframe that I've extracted from a json object using pd.json_normalize.
It has 4 rows and over 60 columns, and with the exception of the 'ts' column there are no columns where there is more than one value.
Is it possible to merge the four rows togather to give one row which can then be written to a .csv file? I have searched the documentation and found no information on this.
To give context, the data is a one time record from a weather station, I will have records at 5 minute intervals and need to put all the records into a database for further use.
I've managed to get the desired result, it's a little convoluted, and i would expect that there is a much more succint way to do it, but I basically manipulated the dataframe, replaced all nan's with zero, replaced some strings with ints and added the columns together as shown in the code below:
with open(fname,'r') as d:
ws=json.loads(next(d))
df=pd.json_normalize(ws['sensors'], record_path='data')
df3=pd.concat([df.iloc[0],df.iloc[1], df.iloc[2],
df.iloc[3]],axis=1)
df3.rename(columns={0 :'a', 1:'b', 2 :'c' ,3 :'d'}, inplace=True)
df3=df3.fillna(0)
df3.loc['ts',['b','c','d']]=0
df3.loc[['ip_v4_gateway','ip_v4_netmask','ip_v4_address'],'c']=int(0)
df3['comb']=df3['a']+df3['b']+df3['c']+df3['d']
df3.drop(columns=['a','b','c','d'], inplace=True)
df3=df3.T
As has been said by quite a few people, the documentation on this is very patchy, so I hope this may help someone else who is struggling with this problem! (and yes, i know that one line isn't indented properly, get over it!)
How can I make sure that am able to retain the latest version of a row (based on unique constraints) with Dask using Parquet files and partition_on?
The most basic use case is that I want to query a database for all rows where updated_at > yesterday and partition the data based on the created_at_date (meaning that there can be multiple dates which have been updated, and these files already exist most likely).
└───year=2019
└───month=2019-01
2019-01-01.parquet
2019-01-02.parquet
So I want to be able to combine my new results from the latest query and the old results on disk, and then retain the latest version of each row (id column).
I currently have Airflow operators handling the following logic with Pandas and it achieves my goal. I was hoping to accomplish the same thing with Dask without so much custom code though:
Partition data based on specified columns and save files for each partition (common example would be using the date or month column to create files 2019-01-01.parquet or 2019-12.parquet
Example:
df_dict = {k: v for k, v in df.groupby(partition_columns)}
Loop through each partition and check if the file name exists. If there is already a file with the same name, read that file as a separate dataframe and concat the two dataframes
Example:
part = df_dict[partition]
part= pd.concat([part, existing], sort=False, ignore_index=True, axis='index')
Sort the dataframes and drop duplicates based on a list of specified columns (unique constraints sorted by file_modified_timestamp or updated_at columns typically to retain the latest version of each row)
Example:
part = part.sort_values([sort_columns], ascending=True).drop_duplicates(unique_constraints, keep='last')
The end result is that my partitioned file (2019-01-01.parquet) has now been updated with the latest values.
I can't think of a way to use the existing parquet methods of a dataframe to do what you are after, but assuming your dask dataframe is reasonably partitioned, you could do the exact same set of steps within a map_partitions call. This means you pass the constituent pandas dataframes to the function, which acts on them. So long as the data in each partition is non-overlapping, you will do ok.
I have a s3 bucket with partitioned data underlying Athena. Using Athena I see there are 104 billion rows in my table. This about 2 years of data.
Let's call it big_table.
Partitioning is by day, by hour so 07-12-2018-00,01,02 ... 24 for each day. Athena field is partition_datetime.
In my use case I need the data from 1 month only, which is about 400 million rows.
So the question has arisen - load directly from:
1. files
spark.load(['s3://my_bucket/my_schema/my_table_directory/07-01-2018-00/file.snappy.parquet',\
's3://my_bucket/my_schema/my_table_directory/07-01-2018-01/file.snappy.parquet' ],\
.
.
.
's3://my_bucket/my_schema/my_table_directory/07-31-2018-23/file.snappy.parquet'])
or 2. via pyspark using SQL
df = spark.read.parquet('s3://my_bucket/my_schema/my_table_directory')
df = df.registerTempTable('tmp')
df = spark.sql("select * from my_schema.my_table_directory where partition_datetime >= '07-01-2018-00' and partition_datetime < '08-01-2018-00'")
I think #1 is more efficient because we are only bringing in the data for the period in question.
2 seems inefficient to me because the entire 104 billion rows (or more accurately partition_datetime fields) have to be traversed to satisfy the SELECT. I'm counseled that this really isn't an issue because of lazy execution and there is never a df with all 104 billion rows. I still say at some point each partition must be visited by the SELECT, therefore option 1 is more efficient.
I am interested in other opinions on this. Please chime in
What you are saying might be true, but it is not efficient as it will never scale. If you want data for three months, you cannot specify 90 lines of code in your load command. It is just not a good idea when it comes to big data. You can always perform operations on a dataset that big by using a spark standalone or a YARN cluster.
You could use wildcards in your path to load only files in a given range.
spark.read.parquet('s3://my_bucket/my_schema/my_table_directory/07-{01,02,03}-2018-*/')
or
spark.read.parquet('s3://my_bucket/my_schema/my_table_directory/07-*-2018-*/')
Thom, you are right. #1 is more efficient and the way to do it. However, you can create a collection of list of files to read and then ask spark to read those files only.
This blog might be helpful for your situation.