dataframe failing to be converted to parquet - pandas

I have been trying to wrap my head around this for two days and haven't been successful so reaching out to the community.
I have a dataframe which has almost a million rows, and when I am trying to convert the dataframe to a parquet, I am getting the following error.
Check failed: (item_builder_->length()) == (key_builder_->length()) keys and items builders don't have the same size in MapBuilder
Looking at the error I thought that some keys who have maps as value is causing this error.
So, I looked at the possible culprit columns in our case lets call the column geos.
In the schema,
geos: map<string, string>
So, I am getting the error
Check failed: (item_builder_->length()) == (key_builder_->length()) keys and items builders don't have the same size in MapBuilder
when I am converting the whole df.
I thought that maybe one of the values of geos is not valid. So to find that row, I tried to individually convert each row to parquet. Surprisingly, all rows managed to convert to parquet.
Now, I think that this error is because of some column size limit that fast-parquet/pyarrow has.
I am using the versions
pyarrow==9.0.0
pandas==1.2.5
Anyone has experienced this before and can validate my reasoning?
Help with reasoning about the error

Related

Pyspark - identical dataframe filter operation gives different output

I’m facing a particularly bizarre issue while firing filter queries on a spark dataframe. Here's a screenshot of the filter command I'm trying to run:
As you can see, I'm trying to run the same command multiple times. Each time, it's giving a different number of rows. It is actually meant to return 6 records, but it ends up showing a random number of records every time.
FYI, The underlying data source (from which I'm creating the dataframe) is an Avro file in a Hadoop data lake.
This query only gives me consistent results if I cache the dataframe. But this is not always possible for me because the dataframe might be very huge and hence would choke up memory resources if I cache it.
What might be the possible reasons for this random behavior? Any advice on how to fix it?
Many thanks :)

Updating Parquet datasets where the schema changes overtime

I have a single parquet file that I have been incrementally been building every day for several months. The file size is around 1.1GB now and when read into memory it approaches my PCs memory limit. So, I would like to split it up into several files base on the year and month combination (i.e. Data_YYYYMM.parquet.snappy) that will all be in a directory.
My current process reads in the daily csv that I need to append, reads in the historical parquet file with pyarrow and converts to pandas, concats the new and historical data in pandas (pd.concat([df_daily_csv, df_historical_parquet])) and then writes back to a single parquet file. Every few weeks the schema of the data can change (i.e. a new column). With my current method this is not an issue since the concat in pandas can handle the different schemas and I overwriting it each time.
By switching to this new setup I am worried about having inconsistent schemas between months and then being unable to read in data over multiple months. I have tried this already and gotten errors due to non matching schemas. I thought might be able to specify this with the schema parameter in pyarrow.parquet.Dataset. From the doc it looks like it takes a type of pyarrow.parquet.Schema. When I try using this I get AttributeError: module 'pyarrow.parquet' has no attribute 'Schema'. I also tried taking the schema of a pyarrow Table (table.schema) and passing that to the schema parameter but got an error msg (sry I forget error right now and cant connect workstation right now so cant reproduce error - I will update with this info when I can).
I've seen some mention of schema normalization in the context of the broader Arrow/Datasets project but I'm not sure if my use case fits what that covers and also the Datasets feature is experimental so I dont want to use it in production.
I feel like this is a pretty common use case and I wonder if I am missing something or if parquet isn't meant for schema changes over time like I'm experiencing. I've considered investigating the schema of the new file and comparing vs historical and then if there is change deserializing, updating schema, and reserializing every file in the dataset but I'm really hoping to avoid that.
So my questions are:
Will using a pyarrow parquet Dataset (or something else in the pyarrow API) allow me to read in all of the data in multiple parquet files even if the schema is different? To be specific, my expectation is that the new column would be appended and the values prior to when this column were available would be null). If so, how do you do this?
If the answer to 1 is no, is there another method or library for handling this?
Some resources I've been going through.
https://arrow.apache.org/docs/python/dataset.html
https://issues.apache.org/jira/browse/ARROW-2659
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset

BQ Switching to TIMESTAMP Partitioned Table

I'm attempting to migrate IngestionTime (_PARTITIONTIME) to TIMESTAMP partitioned tables in BQ. In doing so, I also need to add several required columns. However, when I flip the switch and redirect my dataflow to the new TIMESTAMP partitioned table, it breaks. Things to note:
Approximately two million rows (likely one batch) is successfully inserted. The job continues to run but doesn't insert anything after that.
The job runs in batches.
My project is entirely in Java
When I run it as streaming, it appears to work as intended. Unfortunately, it's not practical for my use case and batch is required.
I've been investigating the issue for a couple of days and tried to break down the transition into the smallest steps possible. It appears that the step responsible for the error is introducing REQUIRED variables (it works fine when the same variables are NULLABLE). To avoid any possible parsing errors, I've set default values for all of the REQUIRED variables.
At the moment, I get the following combination of errors and I'm not sure how to address any of them:
The first error, repeats infrequently but usually in groups:
Profiling Agent not found. Profiles will not be available from this
worker
Occurs a lot and in large groups:
Can't verify serialized elements of type BoundedSource have well defined equals method. This may produce incorrect results on some PipelineRunner
Appears to be one very large group of these:
Aborting Operations. java.lang.RuntimeException: Unable to read value from state
Towards the end, this error appears every 5 minutes only surrounded by mild parsing errors described below.
Processing stuck in step BigQueryIO.Write/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 20m00s without outputting or completing in state finish
Due to the sheer volume of data my project parses, there are several parsing errors such as Unexpected character. They're rare but shouldn't break data insertion. If they do, I have a bigger problem as the data I collect changes frequently and I can adjust the parser only after I see the error, and therefore, see the new data format. Additionally, this doesn't cause the ingestiontime table to break (or my other timestamp partition tables to break). That being said, here's an example of a parsing error:
Error: Unexpected character (',' (code 44)): was expecting double-quote to start field name
EDIT:
Some relevant sample code:
public PipelineResult streamData() {
try {
GenericSection generic = new GenericSection(options.getBQProject(), options.getBQDataset(), options.getBQTable());
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
.apply(options.getWindowDuration() + " Windowing", generic.getWindowDuration(options.getWindowDuration()))
.apply(generic.getPubsubToString())
.apply(ParDo.of(new CrowdStrikeFunctions.RowBuilder()))
.apply(new BigQueryBuilder().setBQDest(generic.getBQDest())
.setStreaming(options.getStreamingUpload())
.setTriggeringFrequency(options.getTriggeringFrequency())
.build());
return pipeline.run();
}
catch (Exception e) {
LOG.error(e.getMessage(), e);
return null;
}
Writing to BQ. I did try to set the partitoning field here directly, but it didn't seem to affect anything:
BigQueryIO.writeTableRows()
.to(BQDest)
.withMethod(Method.FILE_LOADS)
.withNumFileShards(1000)
.withTriggeringFrequency(this.triggeringFrequency)
.withTimePartitioning(new TimePartitioning().setType("DAY"))
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER);
}
After a lot of digging, I found the error. I had parsing logic (a try/catch) that returned nothing (essentially a null row) in the event there was a parsing error. This would break BigQuery as my schema had several REQUIRED rows.
Since my job ran in batches, even one null row would cause the entire batch job to fail and not insert anything. This also explains why streaming inserted just fine. I'm surprised that BigQuery didn't throw an error claiming that I was attempting to insert a null into a required field.
In reaching this conclusion, I also realized that setting the partition field in my code was also necessary as opposed to just in the schema. It could be done using
.setField(partitionField)

Filtering a spark partitioned table is not working in Pyspark

I am using spark 2.3 and have written one dataframe to create hive partitioned table using dataframe writer class method in pyspark.
newdf.coalesce(1).write.format('orc').partitionBy('veh_country').mode("overwrite").saveAsTable('emp.partition_Load_table')
Here is my table structure and partitions information.
hive> desc emp.partition_Load_table;
OK
veh_code varchar(17)
veh_flag varchar(1)
veh_model smallint
veh_country varchar(3)
# Partition Information
# col_name data_type comment
veh_country varchar(3)
hive> show partitions partition_Load_table;
OK
veh_country=CHN
veh_country=USA
veh_country=RUS
Now I am reading this table back in pyspark inside a dataframe.
df2_data = spark.sql("""
SELECT *
from udb.partition_Load_table
""");
df2_data.show() --> is working
But I am not able to filter it using partition key column
from pyspark.sql.functions import col
newdf = df2_data.where(col("veh_country")=='CHN')
I am getting below error message:
: java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive.
You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem,
however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)
whereas when I am creating dataframe by specifying the hdfs absolute path of table. filter and where clause is working as expected.
newdataframe = spark.read.format("orc").option("header","false").load("hdfs/path/emp.db/partition_load_table")
below is working
newdataframe.where(col("veh_country")=='CHN').show()
my question is that why it was not able to filter the dataframe in first place. and also why it's throwing an error message " Filtering is supported only on partition keys of type string " even though my veh_country is defined as string or varchar datatypes.
I have stumbled on this issue also. What helped for me was to do this line:
spark.sql("SET spark.sql.hive.manageFilesourcePartitions=False")
and then use spark.sql(query) instead of using dataframe.
I do not know what happens under the hood, but this solved my problem.
Although it might be too late for you (since this question was asked 8 months ago), this might help for other people.
I know the topic is quite old but:
I've received same error but the actual source problem was hidden much deeper in logs. If you facing same problem as me, go to the end of your stack trace and you might find actual reason for job to be failing. In my case:
a. org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:865)\n\t... 142 more\nCaused by: MetaException(message:Rate exceeded (Service: AWSGlue; Status Code: 400; Error Code: ThrottlingException ... which basically means I've exceeded AWS Glue Data Catalog quota OR:
b. MetaException(message:1 validation error detected: Value '(<my filtering condition goes here>' at 'expression' failed to satisfy constraint: Member must have length less than or equal to 2048 which means that filtering condition I've put in my dataframe definition is too long
Long story short, deep dive into your logs because reason for your error might be really simple, just the top message is far from being clear.
If you are working with tables that has huge number of partitions (in my case hundreds of thousands) I would strongly recommend against setting spark.sql.hive.manageFilesourcePartitions=False . Yes, it will resolve the issue but the performance degradation is enormous.

Too many errors [invalid] ecountered when loading data into bigquery

I enriched a public dataset of reddit comments with data from LIWC (Linguistic Inquiry and Word Count). I have 60 files á 600mb. The idea is now to upload to BigQuery, getting them together and analyze the results. Alas i faced some problems.
For a first test I had a test sample with 200 rows and 114 columns. Here is a link to the csv i used
I first asked on Reddit and fhoffa provided a really good answer. The problem seems to be the newlines (/n) in the body_raw column, as redditors often include them in their text. It seems BigQuery cannot process them.
I tried to transfer the original data, which i transfered to storage, back to BigQuery, unedited, untouched, but the same problem. BigQuery cannot even process the original data, which comes from BigQuery...?
Anyway, I can open the csv without problems in other programs such as R, which means that the csv itself is not damaged or the schema is inconsistent. So fhoffa's command should get rid of it.
bq load --allow_quoted_newlines --allow_jagged_rows --skip_leading_rows=1 tt.delete_201607a myproject.newtablename gs://my_testbucket/dat2.csv body_raw,score_hidden,archived,name,author,author_flair_text,downs,created_utc,subreddit_id,link_id,parent_id,score,retrieved_on,controversiality,gilded,id,subreddit,ups,distinguished,author_flair_css_class,WC,Analytic,Clout,Authentic,Tone,WPS,Sixltr,Dic,function,pronoun,ppron,i,we,you,shehe,they,ipron,article,prep,auxverb,adverb,conj,negate,verb,adj,compare,interrog,number,quant,affect,posemo,negemo,anx,anger,sad,social,family,friend,female,male,cogproc,insight,cause,discrep,tentat,certain,differ,percept,see,hear,feel,bio,body,health,sexual,ingest,drives,affiliation,achieve,power,reward,risk,focuspast,focuspresent,focusfuture,relativ,motion,space,time,work,leisure,home,money,relig,death,informal,swear,netspeak,assent,nonflu,filler,AllPunc,Period,Comma,Colon,SemiC,QMark,Exclam,Dash,Quote,Apostro,Parenth,OtherP
The output was:
Too many positional args, still have ['body_raw,score_h...]
If i take away "tt.delete_201607a" from the command, i get the same error message I have often seen now:
BigQuery error in load operation: Error processing job 'xx': Too many errors encountered.
So i do not know what to do here. Should I get rid of /n with Python? That would take probably days (although im not sure, i am not a programmer), as my complete data set is around 55 million rows.
Or do you have any other ideas?
I checked again, and I was able to load the file you left on dropbox without a problem.
First I made sure to download your original file:
wget https://www.dropbox.com/s/5eqrit7mx9sp3vh/dat2.csv?dl=0
Then I run the following command:
bq load --allow_quoted_newlines --allow_jagged_rows --skip_leading_rows=1 \
tt.delete_201607b dat2.csv\?dl\=0 \
body_raw,score_hidden,archived,name,author,author_flair_text,downs,created_utc,subreddit_id,link_id,parent_id,score,retrieved_on,controversiality,gilded,id,subreddit,ups,distinguished,author_flair_css_class,WC,Analytic,Clout,Authentic,Tone,WPS,Sixltr,Dic,function,pronoun,ppron,i,we,you,shehe,they,ipron,article,prep,auxverb,adverb,conj,negate,verb,adj,compare,interrog,number,quant,affect,posemo,negemo,anx,anger,sad,social,family,friend,female,male,cogproc,insight,cause,discrep,tentat,certain,differ,percept,see,hear,feel,bio,body,health,sexual,ingest,drives,affiliation,achieve,power,reward,risk,focuspast,focuspresent,focusfuture,relativ,motion,space,time,work,leisure,home,money,relig,death,informal,swear,netspeak,assent,nonflu,filler,AllPunc,Period,Comma,Colon,SemiC,QMark,Exclam,Dash,Quote,Apostro,Parenth,OtherP,oops
As mentioned in reddit, you need the following options:
--allow_quoted_newlines: There are newlines inside some strings, hence the CSV is not strictly newline delimited.
--allow_jagged_rows: Not every row has the same number of columns.
,oops: There is an extra column in some rows. I added this column to the list of columns.
When it says "too many positional arguments", it's because your command says:
tt.delete_201607a myproject.newtablename
Well, tt.delete_201607a is how I named my table. myproject.newtablename is how you named your table. Choose one, not both.
Are you sure you are not able to load the sample file you left on dropbox? Or you are getting errors from rows I can't find on that file?