I have a very large amount of data in my S3 bucket partitioned by two columns MODULE and DATE
such that the file structure of my parquets are:
s3://my_bucket/path/file.parquet/MODULE='XYZ'/DATE=2020-01-01
I have 7 MODULE and the DATE ranges from 2020-01-01 to 2020-09-01.
I found a discrepancy in the data and need to correct the MODULE entries for one of the module. Basically I need to change all data for a particular index number, belonging to MODULE XYZ to MODULE ABC.
I can do this in pyspark by loading the data frame and doing something like:
df=df.withColumn('MODULE', when(col('index')==34, "ABC").otherwise(col('MODULE')))
But how do I repartition it so that only those entries that are changed get moved to the ABC MODULE partition? If I do something like:
df.mode('append').partitionBy('MODULE','DATE').parquet(s3://my_bucket/path/file.parquet")
I would be adding the data along with the erroneous MODULE data. Plus, I have almost a years worth of data and don't want to repartition the entire dataset as it would take a very long time.
Is there a way to do this?
If I understand well, you have data in partition MODULE=XYZ that should be moved to MODULE=ABC.
First, identify the impacted files.
from pyspark.sql import functions as F
file_list = df.where(F.col("index") == 34).select(
F.input_file_name()
).distinct().collect()
Then, you create a dataframe based only on theses files, you use it to complete both MODULE.
df = spark.read.parquet(file_list).withColumn(
"MODULE", when(col("index") == 34, "ABC").otherwise(col("MODULE"))
)
df.write.parquet(
"s3://my_bucket/path/ABC/", mode="append", partitionBy=["MODULE", "DATE"]
)
At this point, ABC should be OK (you just added the missing data), but XYZ should be wrong because of duplicate data. To recover XYZ, you just need to delete the list of files in file_list.
IIUC you can do this by filtering the data for that particular index then save that data with date as partition.
df=df.withColumn('MODULE', when(col('index')==34, "ABC").otherwise(col('MODULE')))
df = df.filter(col('index')==34)
df.mode('overwrite').partitionBy('DATE').parquet(s3://my_bucket/path/ABC/")
In this way you will only end up modifying only the changed module i.e. ABC
Related
I tried loading my csv file using pd.read_csv. It has 33 million records and takes too much time for loading and querying also.
I have data of 200k customers.
This is the code I have written for sampling
Data is loading quickly when using a dask dataframe but takes much time for queries.
df_s = df.sample(frac = 300000/33819106,replace = None,random_state = 10)
This works fine but the customers have ordered many products. In the sample how to include all the products of the customers. How to sample based on customer id?
Load your data into a dataframe and then sample from it. Output to a new .csv that is easier to read from.
df = pd.read_csv('customers.csv')
df = df.sample(frac=.2) # 20% of the rows will be sampled.
df.to_csv('sample_customers.csv') # Create an easier to work with .csv
Generally the format of a question on here is
Description of problem
Desired outcome
What you've tried
Minimum reproducible example
I have seen some other posts about this, but have not found an answer that permanently works.
We have a table, and I had to add two columns to it. In order to do so, I dropped the table and recreated it. But since it was an external table, it did not drop the associated data files. The data gets loaded from a control file and is partitioned by date. So let's say the dates that were in the table were 2021-01-01 and 2021-01-02. But only 2021-01-02 is in the control file. So when I am loading that date, it gets re-run with the new columns and everything is fine. But 2021-01-01 is still there, but with a different schema.
This is no issue in Hive, as it seems to default to resolve by name, not position. But Impala resolves by position, so the new columns throw it off.
If I have a table that before had the columns c1,c2,c3, and now have the additional columns c4,c5, if I try to run a query such as
select * from my_table where c5 is null limit 3;
This will give an incompatible parquet schema error in Impala (but Hive is fine, it would just have null for c4 and c5 for the date 2021-01-01).
If I run the command set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; and then the above query again, it is fine. But I would have to run set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; at the beginning of each session, which is not ideal.
From searching online, I have come up with a few solutions:
Drop all data files when creating the new table and just start loading from scratch (I think we want to keep the old files)
Re-load each date (this might not be ideal as there could be many, many dates that would have to be re-loaded and overwritten)
Change the setting permanently in Cloudera Manager (I do not have access to CM and don't know how feasible it would be to change it)
Are there any other solutions to have it so I don't have to run set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; each time I want to use this table in Impala?
Background
I'm need to design an Airflow pipeline to load CSV's into BigQuery.
I know the CSV's frequently have a changing schema. After loading the first file the schema might be
id | ps_1 | ps_1_value
when the second file lands and I load it it might look like
id | ps_1 | ps_1_value | ps_1 | ps_2_value.
Question
What's the best approach to handling this?
My first thought on approaching this would be
Load the second file
Compare the schema against the current table
Update the table, adding two columns (ps_2, ps_2_value)
Insert the new rows
I would do this in a PythonOperator.
If file 3 comes in and looks like id | ps_2 | ps_2_value I would fill in the missing columns and do the insert.
Thanks for the feedback.
After loading two prior files example_data_1.csv and example_data_2.csv I can see that the fields are being inserted into the correct columns, with new columns being added as needed.
Edit: The light bulb moment was realizing that the schema_update_options exist. See here: https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.SchemaUpdateOption.html
csv_to_bigquery = GoogleCloudStorageToBigQueryOperator(
task_id='csv_to_bigquery',
google_cloud_storage_conn_id='google_cloud_default',
bucket=airflow_bucket,
source_objects=['data/example_data_3.csv'],
skip_leading_rows=1,
bigquery_conn_id='google_cloud_default',
destination_project_dataset_table='{}.{}.{}'.format(project, schema, table),
source_format='CSV',
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_APPEND',
schema_update_options=['ALLOW_FIELD_RELAXATION', 'ALLOW_FIELD_ADDITION'],
autodetect=True,
dag=dag
)
Basically, the recommended pipeline for your case consists in creating a temporary table for treating your new data.
Since AirFlow is an orchestration tool, its not recommended to create big flows of data through it.
Given that, your DAG could be very similar to your current DAG:
Load the new file to a temporary table
Compare the actual table's schema and the temporary table's schema.
Run a query to move the data from the temporary table to the actual table. If the temporary table has new fields, add them to the actual table using the parameter schema_update_options. Besides that, if your actual table has fields in NULLABLE mode, it will be able to easily deal with missing columns case your new data have some missing field.
Delete your temporary table
If you're using GCS, move your file to another bucket or directory.
Finally, I would like to point some links that might be useful to you:
AirFlow Documentation (BigQuery's operators)
An article which shows a problem similar to yours ans where you can find some of the mentioned informations.
I hope it helps
I have sales data by store and product_category for every week in the following format.
STORE|PRODUCT_CAT|WK_ENDING|<PREDICTOR_VARIABLES>|TARGET_VARIABLE
S1|P1||2016-01-01|..|....
S1|P1||2016-01-08|..|....
S1|P1||2016-01-15|..|....
S1|P2||2016-01-01|..|....
S1|P2||2016-01-08|..|....
S1|P2||2016-01-15|..|....
S2|P1||2016-01-01|..|....
S2|P1||2016-01-08|..|....
S2|P1||2016-01-15|..|....
S2|P2||2016-01-01|..|....
S2|P2||2016-01-08|..|....
S2|P2||2016-01-15|..|....
...
...
As you can see it has multiple records by week for every Store - Product combination.
There could be about 200 different stores and ~50 different product categories i.e. we would have ~200 x ~50 = ~10,000 different Store - product combinations (say). For every such combination we will have data for about 4-5 years i.e. 250 records say.
Requirement is that we run separate regresssion models for each of the store-prod combinations.That means we need to run thousands of regressions but on very small datasets. What is the way to go about this?
Options tried / thought about -
1. Usual "FOR" loops -
Extracted the unique Store-category combinations and then for each store and for each cat (nested for loop), filtered the data from the above DF and ran the models.
The process runs for about 10-12 stores and then throws memory errors. Note that the above DF is persisted.
I have seen for other similar computations, pySpark is not able to handle for loops, if it has to reference the same DF from inside the For loop.
Following is the code snippet -
main_df.persist() # This is the master dataframe, containing all the above data that is persisted
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
for store in store_lst:
<some calculations like filtering the master dataframe by store etc.. >
main_df_by_store = main_df.filter(main_df['store_id']==str(store))
for cat in cat_lst:
assembler=VectorAssembler(inputCols=['peer_pos_sales'],outputCol='features')
traindata=main_df_by_store.filter(main_df_by_store['rbt_category']==str(cat))
output = assembler.transform(traindata)
modelfit=output.drop('peer_pos_sales').withColumnRenamed('vacant_pos_sales','label')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(modelfit)
result = lrModel.transform(modelfit)
Can we create a Window Function, partitioned by Store, Category and then apply a UDF to run the regressions?
However, it appears that we can only use built-in functions for Window functions, and not UDF? Is that correct?
How to handle this? Looping is killing the server.
This needs to be done in pySpark only.
I unloaded a bunch of data from an RDBMS by month, and loaded it into Google Cloud Storage (GCS) based on that month. I then read the entire data set into a pyspark data frame on a dataproc cluster, and would like to re-write it to GCS based on the day, rather than the month. I've successfully written to cloud storage where each file only contains a certain date, but have not been able to efficiently name the file or directory based on that date. The code below does what I want it to do, but it is VERY inefficient. I also know I could theoretically get around this by rather using parquet files, but my requirements are to write as CSV. Ultimately I want to load this data into bigquery with a table per day, if there is a solution there that would be easier (and I could then just export each per day table to a file).
# Find distinct dates, and then write based on that.
dates = sqlContext.sql("SELECT distinct THE_DATE FROM tbl")
x = dates.collect()
for d in x:
date = d.SLTRN_DT
single_wk = sqlContext.sql("SELECT * FROM tbl where THE_DATE = '{}'".format(date))
towrite = single_wk.map(to_csv)
towrite.coalesce(4).saveAsTextFile('gs://buck_1/AUDIT/{}'.format(date))
So say the data I read in has the dates ['2014-01-01', '2014-01-02', '2014-01-03'] I would want the resulting files / directories to look like this:
gs://buck_1/AUDIT/2014-01-01/part-1
gs://buck_1/AUDIT/2014-01-01/part-2
gs://buck_1/AUDIT/2014-01-01/part-3
gs://buck_1/AUDIT/2014-01-01/part-4
gs://buck_1/AUDIT/2014-01-02/part-1
gs://buck_1/AUDIT/2014-01-02/part-2
gs://buck_1/AUDIT/2014-01-02/part-3
gs://buck_1/AUDIT/2014-01-02/part-4
gs://buck_1/AUDIT/2014-01-03/part-1
gs://buck_1/AUDIT/2014-01-03/part-2
gs://buck_1/AUDIT/2014-01-03/part-3
gs://buck_1/AUDIT/2014-01-03/part-4