I have an unpartitionned table in BigQuery called "rawdata". That table is becoming quite big, and I would like to partition it. I did not find any way to partition the original table, but according to https://cloud.google.com/bigquery/docs/creating-partitioned-tables#restating_data_in_a_partition, I can run a command line that will push from unpartitionned "rawdata" into a partitionned table using a query, but only for a specific day/partition.
My instinct was to use the C# API (we already append data through that) to automate the process of doing the bq query --replace restating from the unpartitionned table, but there doesn't seem to be anything that can do that in the C# code. Do you guys have any recommendation on how to proceed forward? Should I wrap the bq command line execution instead of using Google API?
I am not certain which portion of the API you are referring to, but it looks like you are referring to the Query API here: https://cloud.google.com/bigquery/docs/reference/v2/jobs/query#examples, which won't allow you to pass in a destination table and truncate/append to it.
The Insert API here: https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert#examples can be used to do what you like by filling in the Configuration.Query part here: https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.query
Specifically, you'd want to specify the 'configuration.query.destinationTable' field to be the table partition you want to populate, and the 'configuration.query.createDisposition' and 'configuration.query.writeDisposition' fields that match your requirements.
This is effectively what the shell client does passing the '--replace' and '--destination_table' parameters.
Finally, you might want to check this thread for cost considerations when you created a partitioned table from a non-partitioned one: Migrating from non-partitioned to Partitioned tables
Based on Ahmed's comments I put together some c# code using the c# api client that can restate data for a partition. Our use case was to remove personal information after it's served it's use. The below code will set the second field to a blank value. In the query I noticed you couldn't use the your_table$20180114 syntax in the query that you would use in CLI tool so this is changed to a where clause to retrieve the original data from the corresponding partition.
var queryRestateJob = _sut.CreateQueryJob(#"SELECT field1, '' as field2, field3 FROM your_dataset.your_table WHERE _PARTITIONTIME = TIMESTAMP('2018-01-14')",
new List<BigQueryParameter>(), new QueryOptions
{
CreateDisposition = CreateDisposition.CreateIfNeeded,
WriteDisposition = WriteDisposition.WriteTruncate,
DestinationTable = new TableReference
{
DatasetId = "your_dataset",
ProjectId = "your_project",
TableId = "your_table$20180114"
},
AllowLargeResults = true,
FlattenResults = false,
ProjectId = "your_project"
})
.PollUntilCompleted();
queryRestateJob = queryRestateJob.ThrowOnAnyError();
Related
I have data in S3 which is partitioned in YYYY/MM/DD/HH/ structure (not year=YYYY/month=MM/day=DD/hour=HH)
I set up a Glue crawler for this, which creates a table in Athena, but when I query the data in Athena it gives an error as one field has duplicate name (URL and url , which the SerDe converts to lowercase, causing a name conflict).
To fix this, I manually create another table (using the above table definition SHOW CREATE TABLE), adding 'case.insensitive'= FALSE to the SERDEPROPERTIES
WITH SERDEPROPERTIES ('paths'='deviceType,emailId,inactiveDuration,pageData,platform,timeStamp,totalTime,userId','case.insensitive'= FALSE)
I changed the s3 directory structure to the hive-compatible naming year=/month=/day=/hour= and then created the table with 'case.insensitive'= FALSE, then ran the MSCK REPAIR TABLE command for the new table, which loads all the partitions.
(Complete CREATE TABLE QUERY)
But upon querying, I can only find 1 data column (platform) and the partition columns, rest of all the columns are not parsed. But I've actually copied the Glue-generated CREATE TABLE query, with the case_insensitive=false condition.
How can I fix this?
I think you have multiple, separate issues: one with the crawler, and one with the serde, and one with duplicate keys:
Glue Crawler
If Glue Crawler delivered on what they promise they would be a fairly good solution for most situations and would save us from writing the same code over and over again. Unfortunately, if you stray outside of the (undocumented) use cases Glue Crawler was designed for, you often end up with various issues, from the strange to the completely broken (see for example this question, this question, this question, this question, this question, or this question).
I recommend that you skip Glue Crawler and instead write the table DDL by hand (you have a good template in what the crawler created, it just isn't good enough). Then you write a Lambda function (or shell script) that you run on a schedule to add new partitions.
Since your partitioning is only on time, this is a fairly simple script: it just needs to run every once in a while and add the partition for the next period.
It looks like your data is from Kinesis Data Firehose which produces a partitioned structure at hour granularity. Unless you have lots of data coming every hour I recommend you create a table that is only partitioned on date, and run the Lambda function or script once per day to add the next day's partition.
A benefit from not using Glue Crawler is that you don't have to have a one-to-one correspondence between path components and partition keys. You can have a single partition key that is typed as date, and add partitions like this: ALTER TABLE foo ADD PARTITION (dt = '2020-05-13') LOCATION 's3://some-bucket/data/2020/05/13/'. This is convenient because it's much easier to do range queries on a full date than when the components are separate.
If you really need hourly granularity you can either have two partition keys, one which is the date and one the hour, or just the one with the full timestamp, e.g. ALTER TABLE foo ADD PARTITION (ts = '2020-05-13 10:00:00') LOCATION 's3://some-bucket/data/2020/05/13/10/'. Then run the Lambda function or script every hour, adding the next hour's partition.
Having too a granular partitioning doesn't help with performance, and can instead hurt it (although the performance hit comes mostly from the small files and the directories).
SerDe config
As for the reason why you're only seeing the value of the platform column, it's because it's the only case where the column name and property have the same casing.
It's a bit surprising that the DDL you link to doesn't work, but I can confirm that it really doesn't. I tried creating a table from that DDL, but without the pagedata column (I also skipped the partitioning, but that shouldn't make a difference for the test), and indeed only the platform column had any value when I queried the table.
However, when I removed the case.insensitive serde property it worked as expected, which got me thinking that it might not work the way you think it does. I tried setting it to TRUE instead of FALSE, which made the table work as expected again. I think we can conclude from this that the Athena documentation is just wrong when it says "By default, Athena requires that all keys in your JSON dataset use lowercase". In fact, what happens is that Athena lower cases the column names, but it also lower cases the property names when reading the JSON.
With further experimentation it turned out the path property was redundant too. This is a table that worked for me:
CREATE EXTERNAL TABLE `json_case_test` (
`devicetype` string,
`timestamp` string,
`totaltime` string,
`inactiveduration` int,
`emailid` string,
`userid` string,
`platform` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-bucket/data/'
I'd say that case.insensitive seems to cause more problems than it solves.
Duplicate keys
When I added the pagedata column (as struct<url:string>) and added "pageData":{"URL":"URL","url":"url"} to the data, I got the error:
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key "url"
And I got the error regardless of whether the pagedata column was involved in the query or not (e.g. SELECT userid FROM json_case_test also errored). I tried the case.insensitive serde property with both TRUE and FALSE, but it had no effect.
Next, I took a look at the source documentation for the serde, which first of all is worded much better, and secondly contains the key piece of information: that you also need to provide mappings for the columns when you turn off case insensitivity.
With the following serde properties I was able to get the duplicate key issue to go away:
WITH SERDEPROPERTIES (
"case.insensitive" = "false",
"mapping.pagedata" = "pageData",
"mapping.pagedata.url" = "pagedata.url",
"mapping.pagedata.url2"= "pagedata.URL"
)
You would have to provide mappings for all the columns except for platform, too.
Alternative: use JSON functions
You mentioned in a comment to this answer that the schema of the pageData property is not constant. This is another case where Glue Crawlers unfortunately don't really work. If you're unlucky you'll end up with a flapping schema that includes some properties some days (see for example this question).
What I realised when I saw your comment is that there is another solution to your problem: set up the table manually (as described above) and use string as the type for the pagedata column. Then you can use functions like JSON_EXTRACT_SCALAR to extract the properties you want during query time.
This solution trades increased complexity of the queries for way fewer headaches trying to keep up with an evolving schema.
As of now, Hive Terminal is showing only column headers after a create table code is run. What settings should I change to make Hive Terminal show few rows also, say first 100 rows?
Code I am using to create table t2 from table t1 which resides in the database (I don't know how t1 is created):
create table t2 as
select *
from t1
limit 100;
Now while development, I am writing select * from t2 limit 100; after each create table section to get the rows with headers.
You cannot
The Hive Create Table documentation does not mention anything about showing records. This, combined with my experience in Hive makes me quite confident that you cannot achieve this by mere regular config changes.
Of course you could tap into the code of hive itself, but that is not something to be attempted lightly.
And you should not want to
Changing the create command could lead to all kinds of problems. Especially because unlike the select command, it is in fact an operation on metadata, followed by an insert. Both of these normally would not show you anything.
If you would create a huge table, it would be problematic to show everything. If you choose always to just show the first 100 rows, that would be inconsistent.
There are ways
Now, there are some things you could do:
Change hive itself (not easy, probably not desirable)
Do it in 2 steps (what you currently do)
Write a wrapper:
If you want to automate things and don't like code duplication, you can look into writing a small wrapper function to call the create and select based on just the input of source (and limit) and destination.
This kind of wrapper could be written in bash, python, or whatever you choose.
However, note that if you like executing the commands ad-hoc/manually this may not be suitable, as you will need to start a hive JVM each time you run such a program and thus response time is expected to be slow.
All in all you are probably best off just doing the create first and select second.
The below command mentioned seems to be correct to show the first 100 rows:
select * from <created_table> limit 100;
Paste the code you have written to create the table will help to diagnose the issue in hand!!
Nevertheless , check if you have correctly mentioned the delimiters for the elements, key-value pairs, collection items etc while creating the table.
If you have not defined them correctly you might end up with having only the first row(header) being shown.
Does Google Datalab support time partitioned BigQuery table's specific partition as query result destination tables ? For example:
from gcp import bigquery as bq
queryString = 'SELECT "a1" AS col1'
tabNam = 'Feature.test$20141228'
bq.Query(queryString).execute(table_name=tabNam, table_mode='create')
My guess is that this feature is not needed to be somehow specially supported by Datalab, because the only what you need to do is to supply table name with time partition as suffix (like you have it in your question - Feature.test$20141228). Of course you need to make sure first that your table (Feature.test) is properly configured with timePartitioning table's property
There is great feature for tables in Google Big Query. If you have several tables with the same name + (Date), e.g test20141228, test20141229...
You will have a set of tables with a scroll down button, as shown in the pic, which is really nice.
Then you can use wild card function TABLE_DATE_RANGE([Feature.test], date1, date2) to query tables between testdate1 ~ testdate2, which is also really nice.
I have a column in my database that has multiple values, here is an example of what these values are:
The SQL column is called: extra_fields and is type text
[{"id":"2","value":"Ser mayor de edad"},
{"id":"3","value":"Prueba de extension"},
{"id":"4","value":"99.999.99"},
{"id":"5","value":"10"}}
I have this in a PHP function where I get the values I want to modify:
$db = $this->getDbo();
$query = 'SELECT * FROM #__k2_items where id='.$item_id;
$db->setQuery($query);
$resultado = $db->loadObject();
$campos_extra = json_decode($resultado->extra_fields);
$num_max_cup_descargar = $campos_extra[2]->value;
$num_max_cup_canjear = $campos_extra[3]->value;
$num_max_cup_x_dia = $campos_extra[4]->value;
$num_max_cup_uti_canje_x_user_dia = $campos_extra[5]->value;
$num_max_cup_desc_user = $campos_extra[6]->value;
I am trying to update one of these values, how can I do this?
EDIT: Im using MySQL database.
You will not be able to Update the substring of some value in a text column with only SQL. MySQL doesn't care if you have a stringified JSON object in there or part of a Shakespeare play. You'll have to update this value by doing it in PHP:
$campos_extra = json_decode($resultado->extra_fields);
$num_max_cup_descargar = $campos_extra[2]->value;
$num_max_cup_canjear = $campos_extra[3]->value;
$num_max_cup_x_dia = $campos_extra[4]->value;
$num_max_cup_uti_canje_x_user_dia = $campos_extra[5]->value;
$num_max_cup_desc_user = $campos_extra[6]->value;
$num_max_cup_desc_user = "changedUser";
$campos_extra[6]->value = $num_max_cup_desc_user;
$campos_extra_updated = json_encode($campos_extra);
And put $campos_extra_updated into that field of the database.
In order to dynamically support such a queries you might use two different approaches:
Programmatically manipulate the data
Well, this is rather simple - to summarize it nicely it means that you have to write the logic to manipulate the data end to end. The great benefit of that is a simple fact that your solution will not be dependent on the underlying DB engine.
Use the database specific extensions
Most of the current RDBMS engines have some support for dynamic columns or array data stored in the tables. In case of MySQL there is an open source clone called MariaDB that have support for such a mechanism - see the documentation of Dynamic Columns in MariaDB.
With PostgreSQL you have an option to store JSON and manipulate that directly (limited), array column, composite types if your properties are well defined or hstore to create a schema-less design.
All of the above implementations will have their pros and cons, so you must pick wisely. But all of them comes with added benefits of indexing (and, therefore, queries) against to columns, some general SQL support, etc. Choose wisely :-)
In a script I'm running the following query on an Oracle database:
select nvl(max(to_char(DATA.VALUE)), 'OK')
from DATA
where DATA.FILTER like (select DATA2.FILTER from DATA2 where DATA2.FILTER2 = 'WYZ')
In the actual script it's a bit more complicated, but you get the idea. ;-)
DATA2.FILTER contains the filter which needs to applied on DATA as a LIKE -clause. The idea is to have it as generic as possible, meaning it should be possible to filter on:
FILTER%
%FILTER
FI%LTER
%FILTER%
FILTER (as if the clause was DATA.FILTER = (select DATA2.FILTER from DATA2 where DATA2.FILTER2 = 'WYZ')
The system the script runs on does not allow stored procedures to run for this kind of task, also I can't make the scrip build the query directly before running it.
Whatever data needs to be fetched, it has to be done with one single query.
I've already tried numerous solutions I found online, but no matter what I do, I seem to be misisng the mark.
What am I doing wrong here?
This one?
select nvl(max(to_char(DATA.VALUE)), 'OK')
from DATA
JOIN DATA2 on DATA.FILTER LIKE DATA2.FILTER
where DATA2.FILTER2 = 'WYZ'
Note: The performance of this query is not optimal, because for table DATA Oracle will perform always a "Full Table Scan" - but it works.