Saving Scala SQL Output as DataFrame - sql

I have the following script to run a SQL query:
val df_joined_sales_partyid = spark.sql(""" SELECT a.sales_transaction_id, b.customer_party_id, a.sales_tran_dt
FROM df_sales_tran a
JOIN df_sales_tran_party_xref b
ON a.sales_transaction_id = b.sales_transaction_id
Limit 3""")
I want to know how I can save the output of this query as a permanent data-frame table. I noticed that every time that I run display(df_joined_sales_partyid), it seems to run the query again. How do I avoid running the query multiple times and save the results to a data-frame table. I am new to writing Scala so forgive me if this is an easy question, but I couldn't find a solution online.

// caches results in memory
df_joined_sales_partyid.cache()
// or
// memory and disk, see https://spark.apache.org/docs/2.4.0/api/java/index.html?org/apache/spark/storage/StorageLevel.html for other possible values
df_joined_sales_partyid.persist(StorageLevel.MEMORY_AND_DISK)

Related

Hive Data Flow Issues

I am using Hive on HDInsights/Azure Spark 2.2 Cluster, submitting my queries through Ambari, the data is stored in External tables on Azure Data Lake. Staging and Target tables are partitioned.
I've been working on loading data in Hive today. The flow of data goes from .gz file -> staging table -> target table. It's an incremental load, left join from target to landing to preserve old data and then union all with new data for the full set.
I've noticed some behaviors that seem odd to me, was hoping to gather more insight.
Observation 1: After running the script through, I notice the new data is not present in the staging or the target from the original table/gz file. I wouldn't expect that since there's a UNION ALL present.
Observation 2: I did one step, manually loading data into my staging table from the .gz file/table. I run a simple count(*) on it. It returns 39k, great. I try running a select * where val = XYZ, it returns records, great again. I put a count(*) on that expression, starts returning 0 records.
Apologies if my thoughts are jumbled but wanted to know if there's anybody out there who's experienced similar occurrences and how to overcome them. Let me know any clarifications needed.
Are you sure you don't have spaces in your key ? have you tried trim(val) ?
Observation 2 is really surprising : from the same where predicates, you have rows being returned with a select * but nothing with select(*) ?
Could you include SQL queries and some rows of data ?

Stream Analytics and query and output

I am a learner of Azure Stream Analytics and now I met some issues that I cannot handle and want the help.
I write the query in Stream Analytics and the query pass the test (I upload a json file and run test, the result can output on the page).
And then I run the Stream Analytics job (output to the table), the intput is from the EventHub, however there are no contents output into table.
Then I made several tests and so far my conclusion is:
When I write the query like this:
SELECT * INTO [TABLE] FROM INTPUT
The results can render in table (all contents can output to the table).
However if I write the query like this:
SELECT ID, VALUE INTO [Table] FROM INTPUT
Nothing can ouput to table.
I am confused the reason. I hope someone can answer the question. Because my original query to handle the data is more complex (with join).

How to use Vertica's COPY LOCAL as an sql statement from MATLAB on Windows

I'm trying to insert around 80 million records created using MATLAB into Vertica Database table. I wanted to know if we can call COPY LOCAL statement in MATLAB as a regular sql statement using exec(conn, sql). For test purpose, I tried with a dat file having around 4 million records as following:
sqlstmnt = 'COPY schema.table_name (FK_CUSTOMER_ID,FK_RUN_START_DATE_ID,FK_RUN_END_DATE_ID,FK_TRAVEL_ID,FK_ORIGIN_ID,FK_DEST_ID,FK_SEGMENT_ID,SEGMENT_PERCENTAGE,LAST_UPDATED) FROM LOCAL ''/my/file/full/path/test1.dat''';
results = exec(conn,sqlstmnt);
But it gave an error in results.Message like:
[Vertica]JDBC A ResultSet was expected but not generated from query "COPY schema.table_name(FK_CUSTOMER_ID,FK_RUN_START_DATE_ID,FK_RUN_END_DATE_ID,FK_TRAVEL_ID,FK_ORIGIN_ID,FK_DEST_ID,FK_SEGMENT_ID,SEGMENT_PERCENTAGE,LAST_UPDATED) FROM LOCAL '/my/file/full/path/test1.dat'". Query not executed.
I have the data in the '.dat' file in the order in which the columns are mentioned in COPY LOCAL.
I could not find any helpful resource explaining this error.
I have this test1.dat file which I'm able to insert using COPY from vsql but since I run my codes in MATLAB with many iterations,each iteration producing about a million records, I would want to insert them during each iteration. Any help will be really great.
COPY command return ResultSet that includes the amount of loaded data , i see two main options
1) results =exec(conn,sqlstmnt);
2)results = runsqlscript(conn,'nameOfSQLScriptthatIncludeTheCopyCommand.sql')
I hope you will find it useful
Thanks
I just finish reviewing you’re your input sample data .
i see major problem with the mapping of the input csv to the target table .
Main issues are :
1) Lines are broken into 2 lines ( you should prefer having one sample per line and avoid brock it into 2 lines )
Eg : "1,20150101,0,2,2573,2714,1,8.147237e-01
50,48,49,54,45,48,51,-28 12:11:46"
2) when you define data types on vertica table ,eg: timestamp the data on the csv must reflect to it ( what you have is "-28 12:11:46" , this will not work )
After you fix all this issues , make sure you test it using vsql , then go and try it with matlab
I hope you will find it useful.

Spark HiveContext does not retrieve newly inserted records from Hive Table

I am using Spark 1.4. HiveContext is used to connect Hive. I did the following
val hx = new HiveContext(sc)
import hx.implicits._
hx.sql("select * from tab").show
// it is fine, result was shown as expected
then, I inserted a few records into tab from beeline console
hx.refreshTable("tab")
hx.sql("select * from tab").show
// still old records, no newly inserted records
My question is: why the HiveContext didn't retrieve the newly inserted records?
hiveContext.refreshTable(tableName: String) - this will refresh only metadata of the table (not the actual data)
Notes from official documentaition : (credits: https://spark.apache.org)
refreshTable(tableName: String): Unit
Invalidate and refresh all the cached the metadata of the given table. For performance reasons, Spark SQL or the external data source library it uses might cache certain metadata about a table, such as the location of blocks. When those change outside of Spark SQL, users should call this function to invalidate the cache
To retrive newly inserted records:- uncache first and cache again using , uncacheTable(String tableName) and cacheTable(String tableName)
If the target table is partitioned, You need to insert with 'partition' option. If you miss out the partition, data will not be visible.
INSERT OVERWRITE TABLE tablename1 PARTITION (partcol1=val1, partcol2=val2...) SELECT col1,col2,.... FROM tablename2
On a differently slight case, I have an RDD coming from a Spark SQL statement via HiveContext. The solution which worked for me after some experiments was to actually regenerate the RDD itself.
It does not matter whether you are using the DDL by Spark SQL or sending SQL statements directly via hiveContext.sql.
I have seen around people using a "count trick" in order to force the recomputation of a dataset but at least in my attempts I couldn't get to see the new data this way.
Anyway trying caching, refreshing and friends did not work for me, if somebody has some proper pattern here please share.

Hive: create table and write it locally at the same time

Is it possible in hive to create a table and have it saved locally at the same time?
When I get data for my analyses, I usually create temporary tables to track eventual
mistakes in the queries/scripts. Some of these are just temporary tables, while others contain the data that I actually need for my analyses.
What I do usually is using hive -e "select * from db.table" > filename.tsv to get the data locally; however when the tables are big this can take quite some time.
I was wondering if there is some way in my script to create the table and save it locally at the same time. Probably this is not possible, but I thought it is worth asking.
Honestly doing it the way you are is the best way out of the two possible ways but it is worth noting you can preform a similar task in an .hql file for automation.
Using syntax like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/temp' select * from table;
You can run a query and store it somewhere in the local directory (as long as there is enough space and correct privileges)
A disadvantage to this is that with a pipe you get the data stored nicely as '|' delimitation and new line separated, but this method will store the values in the hive default '^b' I think.
A work around is to do something like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select books from table;
But this is only in Hive 0.11 or higher