I have a Thrift database running on Apache Spark 3.1.2 where I have created a Table and inserted values using beeline. It looks like this:
0: jdbc:hive2://localhost:10000/> select * from mydb4.test;
+-------+--------+
| key | value |
+-------+--------+
| 1235 | test4 |
| 123 | test |
+-------+--------+
However, when I try fetch this using pyspark, the returned column names are the following:
database = "mydb4"
table = "test"
jdbcDF = spark.read.format("jdbc") \
.option("url", f"jdbc:hive2://<URL>/mydb4") \
.option("dbtable", table) \
.load()
jdbcDF.select("key").show()
+---+-----+
|key|value|
+---+-----+
|key|value|
|key|value|
+---+-----+
Why can't I see the proper values in the returned table? I am only seeing the column names instead of values.
I have a csv file with quotes in the column values. How to remove those quotes from the column value. for eg.
+--------+------+------+
|sample |id |status|
+--------+------+------+
|00000001|'1111'|'yes' |
|00000002|'1222'|'no' |
|00000003|'1333'|'yes' |
+--------+------+------+
When i read it i should have DF like below without the single quote
+--------+------+------+
|sample |id |status|
+--------+------+------+
|00000001| 1111 | yes |
|00000002| 1222 | no |
|00000003| 1333 | yes |
+--------+------+------+
While loading csv data, You can specify below options & Spark will automatically parses quotes.
Check below code.
spark. \
read. \
option("quote", "\'"). \
option("escape", "\'"). \
csv("<path to directory>")
When I type \l into psql I got
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+----------------------------+----------------------------+-----------------------
postgres | postgres | UTF8 | English_United States.1251 | English_United States.1251 |
template0 | postgres | UTF8 | English_United States.1251 | English_United States.1251 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | English_United States.1251 | English_United States.1251 | =c/postgres +
| | | | | postgres=CTc/postgres
So here I have 1 database names postgres, but if I type \d I got
List of relations
Schema | Name | Type | Owner
--------+------+-------+----------
public | db1 | table | postgres
(1 row)
In pgAdmin I can see 1 database named "postgres", so why \d tells me about db1 database? (I created it earlier and dropped)
From the psql help:
Informational
(options: S = show system objects, + = additional detail)
\d[S+] list tables, views, and sequences
And as your output shows, db1 is a table, not a database...
DROP TABLE db1; will get rid of it.
I trying to restore sql dump that looks like this:
COPY table_name (id, oauth_id, foo, bar) FROM stdin;
1 142 \N xxxxxxx
2 142 \N yyyyyyy
<dozen similar lines>
last line in this dump: \.
command to restore:
psql < table.sql
or
psql --file=dump.sql
\d+ table_name:
Table "public.table_name"
Column | Type | Modifiers | Storage | Stats target | Description
---------------------+-----------------------+-------------------------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('connected_table_name_id_seq'::regclass) | plain | |
oauth_id | integer | not null | plain | |
foo | character varying | | extended | |
bar | character varying | | extended | |
Looks sadly that standard method for backup and rollback does not work :(
Version of the psql: 9.5.4, version of the server: 9.5.2
I need a query to find column names of a table (table metadata) in Bigquery, like the following query in SQL:
SELECT column_name,data_type,data_length,data_precision,nullable FROM all_tab_cols where table_name ='EMP';
BigQuery now supports information schema.
Suppose you have a dataset named MY_PROJECT.MY_DATASET and a table named MY_TABLE, then you can run the following query:
SELECT column_name
FROM MY_PROJECT.MY_DATASET.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'MY_TABLE'
Yes you can get table metadata using INFORMATION_SCHEMA.
One of the examples mentioned in the past link retrieves metadata from the INFORMATION_SCHEMA.COLUMN_FIELD_PATHS view for the commits table in the github_repos dataset, you just have to
Open the BigQuery web UI in the GCP Console.
Enter the following standard SQL query in the Query editor box. INFORMATION_SCHEMA requires standard SQL syntax. Standard SQL is the default syntax in the GCP Console.
SELECT
*
FROM
`bigquery-public-data`.github_repos.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE
table_name="commits"
AND column_name="author"
OR column_name="difference"
Note: INFORMATION_SCHEMA view names are case-sensitive.
Click Run.
The results should look like the following
+------------+-------------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| table_name | column_name | field_path | data_type | description |
+------------+-------------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| commits | author | author | STRUCT<name STRING, email STRING, time_sec INT64, tz_offset INT64, date TIMESTAMP> | NULL |
| commits | author | author.name | STRING | NULL |
| commits | author | author.email | STRING | NULL |
| commits | author | author.time_sec | INT64 | NULL |
| commits | author | author.tz_offset | INT64 | NULL |
| commits | author | author.date | TIMESTAMP | NULL |
| commits | difference | difference | ARRAY<STRUCT<old_mode INT64, new_mode INT64, old_path STRING, new_path STRING, old_sha1 STRING, new_sha1 STRING, old_repo STRING, new_repo STRING>> | NULL |
| commits | difference | difference.old_mode | INT64 | NULL |
| commits | difference | difference.new_mode | INT64 | NULL |
| commits | difference | difference.old_path | STRING | NULL |
| commits | difference | difference.new_path | STRING | NULL |
| commits | difference | difference.old_sha1 | STRING | NULL |
| commits | difference | difference.new_sha1 | STRING | NULL |
| commits | difference | difference.old_repo | STRING | NULL |
| commits | difference | difference.new_repo | STRING | NULL |
+------------+-------------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
For newbies like me, the above is of the following syntax:
select * from project_name.dataset_name.INFORMATION_SCHEMA.COLUMNS where table_catalog=project_name and table_schema=dataset_name and table_name=table_name
Update: This is now possible! See the INFORMATION SCHEMA docs and the answers below.
Answer, circa 2012:
It's not currently possible to retrieve table metadata (i.e. column names and types) via a query, though this isn't the first time it's been requested.
Is there a reason you need to do this as a query? Table metadata is available via the tables API.
Actually it is possible to do so using SQL. To do so you need to query the logging table for the last log of this particular table being created.
For example, assuming the table is loaded/created daily:
CREATE TEMP FUNCTION jsonSchemaStringToArray(jsonSchema String)
RETURNS ARRAY<STRING> AS ((
SELECT
SPLIT(
REGEXP_REPLACE(REPLACE(LTRIM(jsonSchema,'{ '),'"fields": [',''), r'{[^{]+"name": "([^\"]+)"[^}]+}[, ]*', '\\1,')
,',')
));
WITH valid_schema_columns AS (
WITH array_output aS (SELECT
jsonSchemaStringToArray(jsonSchema) AS column_names
FROM (
SELECT
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.schemaJson AS jsonSchema
, ROW_NUMBER() OVER (ORDER BY metadata.timestamp DESC) AS record_count
FROM `realself-main.bigquery_logging.cloudaudit_googleapis_com_data_access_20170101`
WHERE
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.destinationTable.tableId = '<table_name>'
AND
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.destinationTable.datasetId = '<schema_name>'
AND
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.createDisposition = 'CREATE_IF_NEEDED'
) AS t
WHERE
t.record_count = 1 -- grab the latest entry
)
-- this is actually what UNNESTS the array into standard rows
SELECT
valid_column_name
FROM array_output
LEFT JOIN UNNEST(column_names) AS valid_column_name
)
To Check column, You can access your table Through CLI Easy and simple to find
bq query --use_legacy_sql=false 'select Hour, sum(column 1) as column from `project_id.dataset.table_name` where Date(Hour) = '2020-06-10';'