Getting column names in the rows when querying Thrift database from pyspark - sql

I have a Thrift database running on Apache Spark 3.1.2 where I have created a Table and inserted values using beeline. It looks like this:
0: jdbc:hive2://localhost:10000/> select * from mydb4.test;
+-------+--------+
| key | value |
+-------+--------+
| 1235 | test4 |
| 123 | test |
+-------+--------+
However, when I try fetch this using pyspark, the returned column names are the following:
database = "mydb4"
table = "test"
jdbcDF = spark.read.format("jdbc") \
.option("url", f"jdbc:hive2://<URL>/mydb4") \
.option("dbtable", table) \
.load()
jdbcDF.select("key").show()
+---+-----+
|key|value|
+---+-----+
|key|value|
|key|value|
+---+-----+
Why can't I see the proper values in the returned table? I am only seeing the column names instead of values.

Related

How to remove quotes in the column value pyspark

I have a csv file with quotes in the column values. How to remove those quotes from the column value. for eg.
+--------+------+------+
|sample |id |status|
+--------+------+------+
|00000001|'1111'|'yes' |
|00000002|'1222'|'no' |
|00000003|'1333'|'yes' |
+--------+------+------+
When i read it i should have DF like below without the single quote
+--------+------+------+
|sample |id |status|
+--------+------+------+
|00000001| 1111 | yes |
|00000002| 1222 | no |
|00000003| 1333 | yes |
+--------+------+------+
While loading csv data, You can specify below options & Spark will automatically parses quotes.
Check below code.
spark. \
read. \
option("quote", "\'"). \
option("escape", "\'"). \
csv("<path to directory>")

Query to show all column, table and schema names together in IMPALA

I want to get metadata of impala db in one query. Probably It will be like
SELECT columnname,tablename,schemaname from SYSTEM.INFO
Is there a way to do that? and I dont want to fetch only current tables columns for example;
SHOW COLUMN STATS db.table_name
This query is not answer of my question. I want to select all metadata in one query.
From impala-shell you have commands like:
describe table_name
describe formatted table_name
describe database_name
EXPLAIN { select_query | ctas_stmt | insert_stmt }
and the SHOW Statement that is a flexible way to get information about different types of Impala objects. You can follow this link to the Impala documentation SHOW statement.
On the other hand, information about the schema objects is held in the metastore database. This database is shared between Impala and Hive.
In particular, Impala keeps its table definitions in a traditional MySQL or PostgreSQL database known as the metastore, the same database where Hive keeps this type of data. Thus, Impala can access tables defined or loaded by Hive, as long as all columns use Impala-supported data types, file formats, and compression codecs.
If you want to query this information in one shot you would have to query to MySQL, PostgreSQL, Oracle, or etc, it's depending on your particular case.
For example, in my case Impala keeps metadata in MySQL.
use metastore;
-- Database changed
SHOW tables;
+---------------------------+
| Tables_in_metastore |
+---------------------------+
| BUCKETING_COLS |
| CDS |
| COLUMNS_V2 |
| COMPACTION_QUEUE |
| COMPLETED_TXN_COMPONENTS |
| DATABASE_PARAMS |
| DBS |
.......
........
| TAB_COL_STATS |
| TBLS |
| TBL_COL_PRIVS |
| TBL_PRIVS |
| TXNS |
| TXN_COMPONENTS |
| TYPES |
| TYPE_FIELDS |
| VERSION |
+---------------------------+
54 rows in set (0.00 sec)
SELECT * FROM VERSION;
+--------+----------------+----------------------------+-------------------+
| VER_ID | SCHEMA_VERSION | VERSION_COMMENT | SCHEMA_VERSION_V2 |
+--------+----------------+----------------------------+-------------------+
| 1 | 1.1.0 | Hive release version 1.1.0 | 1.1.0-cdh5.12.0 |
+--------+----------------+----------------------------+-------------------+
1 row in set (0.00 sec)
Hope this helps.

Conditional update column B with modified value based on column A

I am facing a large table with data that got imported from a csv. However the delimiters in the csv where not sanitized, so the input data looked something like this:
alex#mail.com:Alex
dummy#mail.com;Bob
foo#bar.com:Foo
spam#yahoo.com;Spam
whatever#mail.com:Whatever
During the import : was defined as the delimiter, so each row with the delimiter ; was not imported properly. This resulted in a table structured like this:
| ID | MAIL | USER |
|-- --|---------------------|----------|
| 1 | alex#mail.com | ALEX |
| 2 | dummy#mail.com;Bob | NULL |
| 3 | foo#bar.com | Foo |
| 4 | spam#yahoo.com;Spam | NULL |
| 5 | whatever#mail.com | Whatever |
As reimporting is no option I was thinking about manually sanitizing the data in the affected rows by using SQL queries. So I tried to combine SELECT and UPDATE statements by filtering rows WHERE USER IS NULL and update both columns with the correct value where applicable.
What you need are string functions. Reading a bit, I find that Google BigQuery has STRPOS() and SUBSTR().
https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#substr
https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#strpos
An update query to fix the situation you are describing looks like this:
update table_name set mail =SUBSTR(mail,1,STRPOS(mail,';')-1), user =SUBSTR(mail,STRPOS(mail,';')+1) where user is null
The idea here is to split mail in its two parts, the part before the ; and the part after. Hope this helps.

Handle Partition in Hive table while using Sqoop import

i have a question on sqoop import utility. I understand we can run a "sqoop import" and get the data from an RDBMS (SQL Server in my case) and directly put it in a hive table (will be created dynamically).
My question is how to create partitions in this hive table if i have to, with the "sqoop import" utility (is it possible?).
After "sqoop import to Hive" is done, i always see a Hive table which is not partitioned. My requirement is to have a partitioned tables on columns x,y,z..
Thanks,
Sid
you can import data directly to hive table and can create partition table and load it directly using sqoop.
Please find below code:
sqoop import \
--connect "jdbc:sqlserver://yourservername:1433;databases=EMP" \
--connection-manager org.apache.sqoop.manager.SQLServerManager \
--username youruserid \
--password yourpassword \
--fields-terminated-by '|' \
--as-textfile \
--delete-target-dir \
--target-dir 'hdfspathlocation' \
--hive-import \
--hive-overwrite \
--hive-table UDB.EMPLOYEE_PARTITION_TABLE \
--hive-partition-key EMPLOYEE_CITY \
--hive-partition-value 'NOIDA' \
--num-mappers 1 \
--query "select TEST_EMP_ID,TEST_EMP_NAME,TEST_EMP_DEPARTMENT,TEST_EMP_SALARY,TEST_EMP_CITY FROM EMP.dbo.TEST_EMP_TABLE where TEST_EMP_CITY = 'NOIDA' AND \$CONDITIONS";
As you can see that this sqoop import will create a partitioned table UDB.EMPLOYEE_PARTITION_TABLE in hive and create a partitioned column as EMPLOYEE_CITY.
this will create a managed table in hive with data in text format.
below is the schema of hive table:
+--------------------------+-----------------------+-----------------------+--+
| col_name | data_type | comment |
+--------------------------+-----------------------+-----------------------+--+
| test_emp_id | int | |
| test_emp_name | string | |
| test_emp_department | string | |
| test_emp_salary | int | |
| test_emp_city | string | |
| employee_city | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| employee_city | string | |
+--------------------------+-----------------------+-----------------------+--+
0 2018-11-30 00:01 /hdfspathlocation/udb.db/employee_partition_table/employee_city=NOIDA
You need to make sure few things.
your hive-partition-key column name should not be part of your database table when you are using hive-import. else you will get below error.
Imported Failed: Partition key TEST_EMP_CITY cannot be a column to import.
keep your partition column at the end of your select statement while specifying the query in sqoop import.
select TEST_EMP_ID,TEST_EMP_NAME,TEST_EMP_DEPARTMENT,TEST_EMP_SALARY,TEST_EMP_CITY FROM EMP.dbo.TEST_EMP_TABLE where TEST_EMP_CITY = 'NOIDA' AND \$CONDITIONS
Let me know if this works for you.

Bigquery query to find the column names of a table

I need a query to find column names of a table (table metadata) in Bigquery, like the following query in SQL:
SELECT column_name,data_type,data_length,data_precision,nullable FROM all_tab_cols where table_name ='EMP';
BigQuery now supports information schema.
Suppose you have a dataset named MY_PROJECT.MY_DATASET and a table named MY_TABLE, then you can run the following query:
SELECT column_name
FROM MY_PROJECT.MY_DATASET.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'MY_TABLE'
Yes you can get table metadata using INFORMATION_SCHEMA.
One of the examples mentioned in the past link retrieves metadata from the INFORMATION_SCHEMA.COLUMN_FIELD_PATHS view for the commits table in the github_repos dataset, you just have to
Open the BigQuery web UI in the GCP Console.
Enter the following standard SQL query in the Query editor box. INFORMATION_SCHEMA requires standard SQL syntax. Standard SQL is the default syntax in the GCP Console.
SELECT
*
FROM
`bigquery-public-data`.github_repos.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE
table_name="commits"
AND column_name="author"
OR column_name="difference"
Note: INFORMATION_SCHEMA view names are case-sensitive.
Click Run.
The results should look like the following
+------------+-------------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| table_name | column_name | field_path | data_type | description |
+------------+-------------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| commits | author | author | STRUCT<name STRING, email STRING, time_sec INT64, tz_offset INT64, date TIMESTAMP> | NULL |
| commits | author | author.name | STRING | NULL |
| commits | author | author.email | STRING | NULL |
| commits | author | author.time_sec | INT64 | NULL |
| commits | author | author.tz_offset | INT64 | NULL |
| commits | author | author.date | TIMESTAMP | NULL |
| commits | difference | difference | ARRAY<STRUCT<old_mode INT64, new_mode INT64, old_path STRING, new_path STRING, old_sha1 STRING, new_sha1 STRING, old_repo STRING, new_repo STRING>> | NULL |
| commits | difference | difference.old_mode | INT64 | NULL |
| commits | difference | difference.new_mode | INT64 | NULL |
| commits | difference | difference.old_path | STRING | NULL |
| commits | difference | difference.new_path | STRING | NULL |
| commits | difference | difference.old_sha1 | STRING | NULL |
| commits | difference | difference.new_sha1 | STRING | NULL |
| commits | difference | difference.old_repo | STRING | NULL |
| commits | difference | difference.new_repo | STRING | NULL |
+------------+-------------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
For newbies like me, the above is of the following syntax:
select * from project_name.dataset_name.INFORMATION_SCHEMA.COLUMNS where table_catalog=project_name and table_schema=dataset_name and table_name=table_name
Update: This is now possible! See the INFORMATION SCHEMA docs and the answers below.
Answer, circa 2012:
It's not currently possible to retrieve table metadata (i.e. column names and types) via a query, though this isn't the first time it's been requested.
Is there a reason you need to do this as a query? Table metadata is available via the tables API.
Actually it is possible to do so using SQL. To do so you need to query the logging table for the last log of this particular table being created.
For example, assuming the table is loaded/created daily:
CREATE TEMP FUNCTION jsonSchemaStringToArray(jsonSchema String)
RETURNS ARRAY<STRING> AS ((
SELECT
SPLIT(
REGEXP_REPLACE(REPLACE(LTRIM(jsonSchema,'{ '),'"fields": [',''), r'{[^{]+"name": "([^\"]+)"[^}]+}[, ]*', '\\1,')
,',')
));
WITH valid_schema_columns AS (
WITH array_output aS (SELECT
jsonSchemaStringToArray(jsonSchema) AS column_names
FROM (
SELECT
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.schemaJson AS jsonSchema
, ROW_NUMBER() OVER (ORDER BY metadata.timestamp DESC) AS record_count
FROM `realself-main.bigquery_logging.cloudaudit_googleapis_com_data_access_20170101`
WHERE
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.destinationTable.tableId = '<table_name>'
AND
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.destinationTable.datasetId = '<schema_name>'
AND
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.createDisposition = 'CREATE_IF_NEEDED'
) AS t
WHERE
t.record_count = 1 -- grab the latest entry
)
-- this is actually what UNNESTS the array into standard rows
SELECT
valid_column_name
FROM array_output
LEFT JOIN UNNEST(column_names) AS valid_column_name
)
To Check column, You can access your table Through CLI Easy and simple to find
bq query --use_legacy_sql=false 'select Hour, sum(column 1) as column from `project_id.dataset.table_name` where Date(Hour) = '2020-06-10';'