Azure Databricks - reading tables with koalas - sql

I am quite new to Databricks, and I am trying to do some basic data exploration with koalas.
When I log into Databricks, under DATA I see 2 main tabs, DATABASE TABLES and DBFS. I managed to read csv files as koalas dataframes (ksdf=ks.read_csv('/FileStore/tables/countryCodes.csv'), but I do not know how I could read as koalas dataframe the tables I see under the DATABASE TABLES. None of those tables have filename extensions, I guess those are SQL tables?
Sorry if my question is too basic, and thanks very much for your help.

You just need to use read_table function as pointed in the documentation:
ksdf = ks.read_table('my_database.my_table')
P.S. It's a part of so called Metastore API

Related

BigQuery Table - Underlying Data's filepath

I am trying to find the filepath for data that is inserted from CSV stored in GCS into a BigQuery table. In Databricks a Describe Detail/Extended would do the trick but I can't find the equivalent command in BigQuery.
I have searched the BigQuery docs and tried using INFORMATION_SCHEMA.TABLES route but it doesn't provide the information I am looking for.
Any insights are greatly appreciated.

Spark based analysis of MSSQL imported file

We have a small size Spark cluster version 3.x. We have to analyze a database dump file obtained from MS SQL Exporting. Its format is .sql that contains first schema of each table and then insert statements. The schema is completely in MS SQL format.
I have searched a lot and found some connectors to MS SQL server but could not found any stuff related to the analysis of MSSQL dump file. What could be the possible way for this Purpose? I am using PySPark API.
Some references that I found in this regard:
https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver15
https://learn.microsoft.com/en-us/sql/big-data-cluster/spark-mssql-connector?view=sql-server-ver15

aws Glue: Is it possible to pull only specific data from a database?

I need to transform a fairly big database table with aws Glue to csv. However I only the newest table rows from the past 24 hours. There ist a column which specifies the creation date of the row. Is it possible, to just transform these rows, without copying the whole table into the csv file? I am using a python script with Spark.
Thank you very much in advance!
There are some Built-in Transforms in AWS Glue which are used to process your data. This transfers can be called from ETL scripts.
Please refer the below link for the same :
https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
You haven't mentioned the type of database that you are trying connect. Anyway for JDBC connections spark has the option of query, in which you can issue the usual SQL query to get the rows you need.

What is the easiest way to query a CSV file in Oracle SQL Developer?

I have a fairly simple CSV file that I would like to use within a SQL query. I'm using Oracle SQL Developer but none of the solutions I have found on the web so far seem to have worked. I don't need to store the data (unless I can use temp tables?) just to query it and show results.
Thank You!
You need to create an EXTERNAL TABLE. This essentially maps a CSV (or indeed any flat file) to a table. You can then use that table in queries. You will not be able to perform DML on the external table.

Incrementally add data to Parquet tables in S3

I would like to keep a copy of my log data in in Parquet on S3 for ad hoc analytics. I mainly work with Parquet through Spark and that only seems to offer operations to read and write whole tables via SQLContext.parquetFile() and SQLContext.saveAsParquetFile().
Is there any way to add data to and existing Parquet table
without writing a whole new copy of it
particularly when it is stored in S3?
I know I can create separate tables for the updates and in Spark I can form the union of the corresponig DataFrames in Spark at query time but I have my doubts about the scalability of that.
I can use something other than Spark if needed.
The way to append to a parquet file is using SaveMode.Append
`yourDataFrame.write.mode(SaveMode.Append).parquet("/your/file")`
You don't need to union DataFrames after creating them separately, just supply all the paths related to your query to the parquetFile(paths) and get one DataFrame. Just as the signature of reading parquet file: sqlContext.parquetFile(paths: String*) suggests.
Under the hood, in newParquetRelation2, all the .parquet files from all the folders you supply, as well as all the _common_medata and _metadata would be filled into a single list and regard equally.