I enriched a public dataset of reddit comments with data from LIWC (Linguistic Inquiry and Word Count). I have 60 files á 600mb. The idea is now to upload to BigQuery, getting them together and analyze the results. Alas i faced some problems.
For a first test I had a test sample with 200 rows and 114 columns. Here is a link to the csv i used
I first asked on Reddit and fhoffa provided a really good answer. The problem seems to be the newlines (/n) in the body_raw column, as redditors often include them in their text. It seems BigQuery cannot process them.
I tried to transfer the original data, which i transfered to storage, back to BigQuery, unedited, untouched, but the same problem. BigQuery cannot even process the original data, which comes from BigQuery...?
Anyway, I can open the csv without problems in other programs such as R, which means that the csv itself is not damaged or the schema is inconsistent. So fhoffa's command should get rid of it.
bq load --allow_quoted_newlines --allow_jagged_rows --skip_leading_rows=1 tt.delete_201607a myproject.newtablename gs://my_testbucket/dat2.csv body_raw,score_hidden,archived,name,author,author_flair_text,downs,created_utc,subreddit_id,link_id,parent_id,score,retrieved_on,controversiality,gilded,id,subreddit,ups,distinguished,author_flair_css_class,WC,Analytic,Clout,Authentic,Tone,WPS,Sixltr,Dic,function,pronoun,ppron,i,we,you,shehe,they,ipron,article,prep,auxverb,adverb,conj,negate,verb,adj,compare,interrog,number,quant,affect,posemo,negemo,anx,anger,sad,social,family,friend,female,male,cogproc,insight,cause,discrep,tentat,certain,differ,percept,see,hear,feel,bio,body,health,sexual,ingest,drives,affiliation,achieve,power,reward,risk,focuspast,focuspresent,focusfuture,relativ,motion,space,time,work,leisure,home,money,relig,death,informal,swear,netspeak,assent,nonflu,filler,AllPunc,Period,Comma,Colon,SemiC,QMark,Exclam,Dash,Quote,Apostro,Parenth,OtherP
The output was:
Too many positional args, still have ['body_raw,score_h...]
If i take away "tt.delete_201607a" from the command, i get the same error message I have often seen now:
BigQuery error in load operation: Error processing job 'xx': Too many errors encountered.
So i do not know what to do here. Should I get rid of /n with Python? That would take probably days (although im not sure, i am not a programmer), as my complete data set is around 55 million rows.
Or do you have any other ideas?

I checked again, and I was able to load the file you left on dropbox without a problem.
First I made sure to download your original file:
Then I run the following command:
bq load --allow_quoted_newlines --allow_jagged_rows --skip_leading_rows=1 \
tt.delete_201607b dat2.csv\?dl\=0 \
As mentioned in reddit, you need the following options:
--allow_quoted_newlines: There are newlines inside some strings, hence the CSV is not strictly newline delimited.
--allow_jagged_rows: Not every row has the same number of columns.
,oops: There is an extra column in some rows. I added this column to the list of columns.
When it says "too many positional arguments", it's because your command says:
tt.delete_201607a myproject.newtablename
Well, tt.delete_201607a is how I named my table. myproject.newtablename is how you named your table. Choose one, not both.
Are you sure you are not able to load the sample file you left on dropbox? Or you are getting errors from rows I can't find on that file?


Pyspark - identical dataframe filter operation gives different output

I’m facing a particularly bizarre issue while firing filter queries on a spark dataframe. Here's a screenshot of the filter command I'm trying to run:
As you can see, I'm trying to run the same command multiple times. Each time, it's giving a different number of rows. It is actually meant to return 6 records, but it ends up showing a random number of records every time.
FYI, The underlying data source (from which I'm creating the dataframe) is an Avro file in a Hadoop data lake.
This query only gives me consistent results if I cache the dataframe. But this is not always possible for me because the dataframe might be very huge and hence would choke up memory resources if I cache it.
What might be the possible reasons for this random behavior? Any advice on how to fix it?
Many thanks :)

Updating Parquet datasets where the schema changes overtime

I have a single parquet file that I have been incrementally been building every day for several months. The file size is around 1.1GB now and when read into memory it approaches my PCs memory limit. So, I would like to split it up into several files base on the year and month combination (i.e. Data_YYYYMM.parquet.snappy) that will all be in a directory.
My current process reads in the daily csv that I need to append, reads in the historical parquet file with pyarrow and converts to pandas, concats the new and historical data in pandas (pd.concat([df_daily_csv, df_historical_parquet])) and then writes back to a single parquet file. Every few weeks the schema of the data can change (i.e. a new column). With my current method this is not an issue since the concat in pandas can handle the different schemas and I overwriting it each time.
By switching to this new setup I am worried about having inconsistent schemas between months and then being unable to read in data over multiple months. I have tried this already and gotten errors due to non matching schemas. I thought might be able to specify this with the schema parameter in pyarrow.parquet.Dataset. From the doc it looks like it takes a type of pyarrow.parquet.Schema. When I try using this I get AttributeError: module 'pyarrow.parquet' has no attribute 'Schema'. I also tried taking the schema of a pyarrow Table (table.schema) and passing that to the schema parameter but got an error msg (sry I forget error right now and cant connect workstation right now so cant reproduce error - I will update with this info when I can).
I've seen some mention of schema normalization in the context of the broader Arrow/Datasets project but I'm not sure if my use case fits what that covers and also the Datasets feature is experimental so I dont want to use it in production.
I feel like this is a pretty common use case and I wonder if I am missing something or if parquet isn't meant for schema changes over time like I'm experiencing. I've considered investigating the schema of the new file and comparing vs historical and then if there is change deserializing, updating schema, and reserializing every file in the dataset but I'm really hoping to avoid that.
So my questions are:
Will using a pyarrow parquet Dataset (or something else in the pyarrow API) allow me to read in all of the data in multiple parquet files even if the schema is different? To be specific, my expectation is that the new column would be appended and the values prior to when this column were available would be null). If so, how do you do this?
If the answer to 1 is no, is there another method or library for handling this?
Some resources I've been going through.

How to get one file in Hive

I tried a Hive process,
which generate words frequency rank from
I would like to output not multiple files but
one file.
I searched the similar question this web site,
I found mapred.reduce.tasks=1,
but it didn't generate one file but 50 files.
The process l tried has 50 input files and
they are all gzip file.
How do I get one merged file?
50 input files size is so large that I suppose the
reason may be some kind of limit.
in your job use Order By clause with some field.
So that hive will enforce to run only one reducer as a result you are going to end up with one file has created in the HDFS.
hive> Insert into
Select * from default.source
order by id;
For more details regards to order by clause refer to this and this links.
Thank you for your kind answers,
you are really saving me.
I am trying order by,
but it is taking much time,
i am waiting for it.
All I have to do is get one file
to make output file into input of
the next step,
I am also going to try simply cat all files from reducer outputs according to the advice,
if I will do it, I am worried that files are unique and does not have any the same word between files , and whether it is normal gzip file made by catting multiple gzip files.

Validating rows before inserting into BigQuery from Dataflow

According to
How do we set maximum_bad_records when loading a Bigquery table from dataflow? there is currently no way to set the maxBadRecords configuration when loading data into BigQuery from Dataflow. The suggestion is to validate the rows in the Dataflow job before inserting them into BigQuery.
If I have the TableSchema and a TableRow, how do I go about making sure that the row can safely be inserted into the table?
There must be an easier way of doing this than iterating over the fields in the schema, looking at their type and looking at the class of the value in the row, right? That seems error-prone, and the method must be fool-proof since the whole pipeline fails if a single row cannot be loaded.
My use case is an ETL job that at first will run on JSON (one object per line) logs on Cloud Storage and write to BigQuery in batch, but later will read objects from PubSub and write to BigQuery continuously. The objects contain a lot of information that isn't necessary to have in BigQuery and also contains parts that aren't even possible to describe in a schema (basically free form JSON payloads). Things like timestamps also need to be formatted to work with BigQuery. There will be a few variants of this job running on different inputs and writing to different tables.
In theory it's not a very difficult process, it takes an object, extracts a few properties (50-100), formats some of them and outputs the object to BigQuery. I more or less just loop over a list of property names, extract the value from the source object, look at a config to see if the property should be formatted somehow, apply the formatting if necessary (this could be downcasing, dividing a millisecond timestamp by 1000, extracting the hostname from a URL, etc.), and write the value to a TableRow object.
My problem is that data is messy. With a couple of hundred million objects there are some that don't look as expected, it's rare, but with these volumes rare things still happen. Sometimes a property that should contain a string contains an integer, or vice-versa. Sometimes there's an array or an object where there should be a string.
Ideally I would like to take my TableRow and pass it by TableSchema and ask "does this work?".
Since this isn't possible what I do instead is I look at the TableSchema object and try to validate/cast the values myself. If the TableSchema says a property is of type STRING I run value.toString() before adding it to the TableRow. If it's an INTEGER I check that it's a Integer, Long or BigInteger, and so on. The problem with this method is that I'm just guessing what will work in BigQuery. What Java data types will it accept for FLOAT? For TIMESTAMP? I think my validations/casts catch most problems, but there are always exceptions and edge cases.
In my experience, which is very limited, the whole work pipeline (job? workflow? not sure about the correct term) fails if a single row fails BigQuery's validations (just like a regular load does unless maxBadRecords is set to a sufficiently large number). It also fails with superficially helpful messages like 'BigQuery import job "dataflow_job_xxx" failed. Causes: (5db0b2cdab1557e0): BigQuery job "dataflow_job_xxx" in project "xxx" finished with error(s): errorResult: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field'. Perhaps there is somewhere that can see a more detailed error message that could tell me which property it was and what the value was? Without that information it could just as well have said "bad data".
From what I can tell, at least when running in batch mode Dataflow will write the TableRow objects to the staging area in Cloud Storage and then start a load once everything is there. This means that there is nowhere for me to catch any errors, my code is no longer running when BigQuery is loaded. I haven't run any job in streaming mode yet, but I'm not sure how it would be different there, from my (admittedly limited) understanding the basic principle is the same, it's just the batch size that's smaller.
People use Dataflow and BigQuery, so it can't be impossible to make this work without always having to worry about the whole pipeline stopping because of a single bad input. How do people do it?
I'm assuming you deserialize the JSON from the file as a Map<String, Object>. Then you should be able to recursively type-check it with a TableSchema.
I'd recommend an iterative approach to developing your schema validation, with the following two steps.
Write a PTransform<Map<String, Object>, TableRow> that converts your JSON rows to TableRow objects. The TableSchema should also be a constructor argument to the function. You can start off making this function really strict -- require that JSON parsed input as Integer directly, for instance, when a BigQuery INTEGER schema was found -- and aggressively declare records in error. Basically, ensure that no invalid records are output by being super-strict in your handling.
Our code here does something somewhat similar -- given a file produced by BigQuery and written as JSON to GCS, we recursively walk the schema and do some type conversions. However, we do not need to validate, because BigQuery itself wrote the data.
Note that the TableSchema object is not Serializable. We've worked around by converting the TableSchema in a DoFn or PTransform constructor to a JSON String and back. See the code in that uses the jsonTableSchema variable.
Use the "dead-letter" strategy described in this blog post to handle bad records -- side output the offending Map<String, Object> rows from your PTransform and write them to a file. That way, you can inspect the rows that failed your validation later.
You might start with some small files and use the DirectPipelineRunner rather than the DataflowPipelineRunner. The direct runner runs the pipeline on your computer, rather than on Google Cloud Dataflow service, and it uses the BigQuery streaming writes. I believe when those writes fail you will get better error messages.
(We use the GCS->BigQuery Load Job pattern for Batch jobs because it's much more efficient and cost-effective, but BigQuery streaming writes in Streaming jobs because they are low-latency.)
Finally, in terms of logging information:
Definitely check Cloud Logging (by following the Worker Logs link on the logs panel.
You may get better information about why the load jobs triggered by your Batch Dataflows fail if you run the bq command-line utility: bq show -j PROJECT:dataflow_job_XXXXXXX.

Big query throwing invalid errors on load jobs

I have a regularly scheduled load job that runs and imports data into bigQuery via the json data format every hour. This process has been working fine for months,now all of a sudden bigQuery has started to throw me back errors about missing required fields.
Naturally the first thing I did was review my schema and compare to one of the JSON files and all required fields are indeed there. Bigquery doesn't throw much information back beyond that, and I have checked and re-checked my data 20 times because I'm usually missing something.
Is this a back-end issue? or perhaps formatting requirements have changed? A perfect example would be JOB # job_2ee5a4be176c421985d7c3eaa84abf4b.It tells me "missing required field(s)", of which there are only 4 in my schema - I check my JSON for this particular job and they are all there.
Any light shed on this would be tremendously helpful, thanks in advance!!
A sample of the json, only the first 4 fields are required in my schema, and they are all there! I have also double-checked to make sure no extra fields are in the json, and each json is on a new line etc.:
{"date":"2013-05-31 20:56:41","sdate":1370033801,"type":"0","act":"1","cid":"139","chain":"5156","hotel":"21441","template":"default","arrival":"2013-08-04 00:00:00","depart":"2013-08-05 00:00:00","window":"64","nights":"1","total":"0.0000","dailyrate":"0.0000","session":"1530894334","source":"google","keyword":"the carolina hotel chapel hill nc","campaign":"organic","medium":"organic","visits":"2","device":"pc","language":"en-us","ip":"","cookies":"2","base_total":"0.0000","base_rate":"0.0000","batch":"batch_1370045767"}
I am a Google engineer who works on BigQuery. Sorry for the trouble; it appears that you're missing a required RECORD field called currencies.
It appears that the old code was accepting this due to a bug. It was creating empty RECORD fields even if one was not specified in the JSON. As a result, a RECORD field that was REQUIRED could be omitted without triggering an error. However, the correct behavior is to trigger an error, which is what the current code does.
It is unfortunate that the error message does not tell you which required field was missing. This is a TODO in the current version of the code.