I am using Hive on HDInsights/Azure Spark 2.2 Cluster, submitting my queries through Ambari, the data is stored in External tables on Azure Data Lake. Staging and Target tables are partitioned.
I've been working on loading data in Hive today. The flow of data goes from .gz file -> staging table -> target table. It's an incremental load, left join from target to landing to preserve old data and then union all with new data for the full set.
I've noticed some behaviors that seem odd to me, was hoping to gather more insight.
Observation 1: After running the script through, I notice the new data is not present in the staging or the target from the original table/gz file. I wouldn't expect that since there's a UNION ALL present.
Observation 2: I did one step, manually loading data into my staging table from the .gz file/table. I run a simple count(*) on it. It returns 39k, great. I try running a select * where val = XYZ, it returns records, great again. I put a count(*) on that expression, starts returning 0 records.
Apologies if my thoughts are jumbled but wanted to know if there's anybody out there who's experienced similar occurrences and how to overcome them. Let me know any clarifications needed.
Are you sure you don't have spaces in your key ? have you tried trim(val) ?
Observation 2 is really surprising : from the same where predicates, you have rows being returned with a select * but nothing with select(*) ?
Could you include SQL queries and some rows of data ?
Related
I have about 100 tables to which we replicate data, e.g. from the Oracle database.
I would like to quickly check that the data replicated to the tables in db2 is the same as in the source system.
Does anyone have a way to do this? I can create 100 transformations, but that's monotonous and time consuming. I would prefer to process this in a loop.
I thought I would keep the queries in a table and reach into it for records.
I read the data from Table input (sql_db2, sql_source, table_name) and write do copy rows to result. Next I read single record and I read a single record and put it into a loop.
But here came a problem because I don't know how to dynamically compare the data for the tables. Each table has different columns and here I have a problem.
I don't know if this is also possible?
You can inject metadata (in this case your metadata would be the column and table names) to a lot of steps in Pentaho, you create a transformation to collect the metadata to inject to another transformation that has only the steps and some basic information, but the bulk of the information of the columns affected by the different steps is in the transformation injecting the metadata.
Check Pentaho official documentation about Metadata Injection (MDI) and the sample with a basic example of metadata injection available in your PDI installation.
I have a very large data set in GPDB from which I need to extract close to 3.5 million records. I use this for a flatfile which is then used to load to different tables. I use Talend, and do a select * from table using the tgreenpluminput component and feed that to a tfileoutputdelimited. However due to the very large volume of the file, I run out of memory while executing it on the Talend server.
I lack the permissions of a super user and unable to do a \copy to output it to a csv file. I think something like a do while or a tloop with more limited number of rows might work for me. But my table doesnt have any row_id or uid to distinguish the rows.
Please help me with suggestions how to solve this. Appreciate any ideas. Thanks!
If your requirement is to load data into different tables from one table, then you do not need to go for load into file and then from file to table.
There is a component named tGreenplumRow which allows you to write direct sql queries (DDL and DML queries) in it.
Below is a sample job,
If you notice, there are three insert statements inside this component. It will be executed one by one separated by semicolon.
I understand the difference between Internal tables and external tables in hive as below
1) if we drop the internal Table File and metadata will be deleted, however , in case of External only metadata will be
deleted
2) if the file data need to be shared by other tools/applications then we go for external table if not
internal table, so that if we drop the table(external) data will still be available for other tools/applications
I have gone through the answers for question "Difference between Hive internal tables and external tables? "
but still I am not clear about the proper uses cases for Internal Table
so my question is why is that I need to make an Internal table ? why cant I make everything as External table?
Use EXTERNAL tables when:
The data is also used outside of Hive.
For example, the data files are read and processed by an existing program that doesn't lock the files.
The data is permanent i.e used when needed.
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the lifecycle of the table and data.
Let's understand it with two simple scenarios:
Suppose you have a data set, and you have to perform some analytics/problem statements on it. Because of the nature of problem statements, few of them can be done by HiveQL, few of them need Pig Latin and few of them need Map Reduce etc., to get the job done. In this situation External Table comes into picture- the same data set can be used to solve entire analytics instead of having different different copies of same data set for the different different tools. Here Hive don't need authority on the data set because several tools are going to use it.
There can be a scenario, where entire analytics/problem statements can be solved by only HiveQL. In such situation Internal Table comes into picture- Means you can put the entire data set into Hive's Warehouse and Hive is going to have complete authority on the data set.
I am using Spark 1.4. HiveContext is used to connect Hive. I did the following
val hx = new HiveContext(sc)
import hx.implicits._
hx.sql("select * from tab").show
// it is fine, result was shown as expected
then, I inserted a few records into tab from beeline console
hx.refreshTable("tab")
hx.sql("select * from tab").show
// still old records, no newly inserted records
My question is: why the HiveContext didn't retrieve the newly inserted records?
hiveContext.refreshTable(tableName: String) - this will refresh only metadata of the table (not the actual data)
Notes from official documentaition : (credits: https://spark.apache.org)
refreshTable(tableName: String): Unit
Invalidate and refresh all the cached the metadata of the given table. For performance reasons, Spark SQL or the external data source library it uses might cache certain metadata about a table, such as the location of blocks. When those change outside of Spark SQL, users should call this function to invalidate the cache
To retrive newly inserted records:- uncache first and cache again using , uncacheTable(String tableName) and cacheTable(String tableName)
If the target table is partitioned, You need to insert with 'partition' option. If you miss out the partition, data will not be visible.
INSERT OVERWRITE TABLE tablename1 PARTITION (partcol1=val1, partcol2=val2...) SELECT col1,col2,.... FROM tablename2
On a differently slight case, I have an RDD coming from a Spark SQL statement via HiveContext. The solution which worked for me after some experiments was to actually regenerate the RDD itself.
It does not matter whether you are using the DDL by Spark SQL or sending SQL statements directly via hiveContext.sql.
I have seen around people using a "count trick" in order to force the recomputation of a dataset but at least in my attempts I couldn't get to see the new data this way.
Anyway trying caching, refreshing and friends did not work for me, if somebody has some proper pattern here please share.
When I am using the above syntax in "Execute row script" step...it is showing success but the temporary table is not getting created. Plz help me out in this.
Yes, the behavior you're seeing is exactly what I would expect. It works fine from the TSQL prompt, throws no error in the transform, but the table is not there after transform completes.
The problem here is the execution model of PDI transforms. When a transform is run, each step gets its own thread of execution. At startup, any step that needs a DB connection is given its own unique connection. After processing finishes, all steps disconnect from the DB. This includes the connection that defined the temp table. Once that happens (the defining connection goes out of scope), the temp table vanishes.
Note, that this means in a transform (as opposed to a Job), you cannot assume a specific order of completion of anything (without Blocking Steps).
We still don't have many specifics about what you're trying to do with this temp table and how you're using it's data, but I suspect you want its contents to persist outside your transform. In that case, you have some options, but a global temp table like this simply won't work.
Options that come to mind:
Convert temp table to a permanent table. This is the simplest
solution; you're basically making a staging table, loading it with a
Table Output step (or whatever), and then reading it with Table
Input steps in other transforms.
Write table contents to a temp file with something like a Text File
Output or Serialze to File step, then reading it back in from the
other transforms.
Store rows in memory. This involves wrapping your transforms in a
Job, and using the Copy Rows to Results and Get Rows from Results steps.
Each of these approaches has its own pros and cons. For example, storing rows in memory will be faster than writing to disk or network, but memory may be limited.
Another step it sounds like you might need depending on what you're doing is the ETL Metadata Injection step. This step allows you in many cases to dynamically move the metadata from one transform to another. See the docs for descriptions of how each of these work.
If you'd like further assistance here, or I've made a wrong assumption, please edit your question and add as much detail as you can.