I am trying to read data from hive into pyspark in order to write csv files. The following sql code results in 5 months:
select distinct posting_date from my_table
When I read the data with pyspark I only get 4 months:
sql_query = 'select * from my_table'
data = spark_session.sql(sql_query)
data.groupBy("posting_date").count().orderBy("posting_date").show()
I had the same problem in the past and I solved it by using the deprecated api for reading sql:
sql_context = SQLContext(spark_session.sparkContext)
data = sql_context.sql(sql_query)
data.groupBy("posting_date").count().orderBy("posting_date").show()
The problem is that for my current project I have the same issue and I cannot solve it with any method.
I also tried to use HiveContext instead of SQLContext but I had no luck.
I have an SQL query which I run in Azure Synapse analytics , to query data from ADLS.
Can I run the same query in Notebook using PySpark in Azure Synapse analytics?
I googled some ways to run sql in notebook, but looks like some modifications to be done to the code to do this.
%%sql or spark.sql("")
Query
SELECT *
FROM OPENROWSET(
BULK 'https://xxx.xxx.xxx.xxx.net/datazone/Test/parquet/test.snappy.parquet',
FORMAT = 'PARQUET'
)
Read the data lake file and write into a dataframe with saveAsTable and query the table as shown below.
df = spark.read.load('abfss://<container-name>#<storage-account-name>.dfs.core.windows.net/<filename>', format='parquet')
df.write.mode("overwrite").saveAsTable("testdb.test2")
Using %%sql
%%sql
select * from testdb.test2
Using %%pyspark
%%pyspark
df = spark.sql("select * from testdb.test2")
display(df)
We're trying to connect Pentaho BI to ClickHouse and sometimes Pentaho generates queries as such:
select
...
from
date_dimension_table,
fact_table,
other_dimension_table
where
fact_table.fact_date = date_dimension_table.date
and date_dimension_table.calendar_year = 2019
and date_dimension_table.month_name in ('April', 'June', ...)
and fact_table.other_dimension_id = other_dimension_table.id
and other_dimension_table.code in ('code1', 'code2', ...)
group by
date_dimension_table.calendar_year,
date_dimension_table.month_name,
other_dimension_table.code;
It produces ClickHouse error: Code: 403, e.displayText() = DB::Exception: Invalid expression for JOIN ON. Expected equals expression, got (code AS c2) IN ('code1', 'code2', ...). Supported syntax: JOIN ON Expr([table.]column, ...) = Expr([table.]column, ...) [AND Expr([table.]column, ...) = Expr([table.]column, ...)...] (version 19.15.3.6 (official build))
Engines used for tables: fact_table - MergeTree, both dimensions - TinyLog.
Thus, questions:
Can this problem be solved by changing table engines? Unfortunately, we can't change query, it's autogenerated.
If not, are there any plans for supporting joins with in clause in ClickHouse in the nearest future?
Thanx.
This issue has been fixed beginning with ClickHouse release v20.3.2.1, 2020-03-12 (see Issue 7314), so you need to upgrade CH.
! Don't forget to check all backward-incompatible changes (see changelog).
Let's reproduce this problem on CH 19.15.3 revision 54426 to get the error you described:
Received exception from server (version 19.15.3):
Code: 403. DB::Exception: Received from localhost:9000. DB::Exception: Invalid expression for JOIN ON. Expected equals expression, got code IN ('code1', 'code2'). Supported syntax: JOIN ON Expr([table.]column, ...) = Expr([table.]column, ...) [AND Expr([table.]column, ...) = Expr([table.]column, ...) ...].
Now execute this query on the latest version of CH (20.3.7 revision 54433) to make sure that it works correctly:
docker pull yandex/clickhouse-server:latest
docker run -d --name ch_test_latest yandex/clickhouse-server:latest
docker exec -it ch_test_latest clickhouse-client
# create tables as described below
..
# execute test query
..
Test preparation:
create table date_dimension_table (
date DateTime,
calendar_year Int32,
month_name String
) Engine = Memory;
create table fact_table (
fact_date DateTime,
other_dimension_id Int32
) Engine = Memory;
create table other_dimension_table (
id Int32,
code String
) Engine = Memory;
Test query:
SELECT
date_dimension_table.calendar_year,
date_dimension_table.month_name,
other_dimension_table.code
FROM date_dimension_table
,fact_table
,other_dimension_table
WHERE (fact_table.fact_date = date_dimension_table.date)
AND (date_dimension_table.calendar_year = 2019)
AND (date_dimension_table.month_name IN ('April', 'June'))
AND (fact_table.other_dimension_id = other_dimension_table.id)
AND (other_dimension_table.code IN ('code1', 'code2'))
GROUP BY
date_dimension_table.calendar_year,
date_dimension_table.month_name,
other_dimension_table.code
I am trying to make a query but google cloud gives a syntax error.
I had coppied this code which written in 2017 .
I have no idea about Sql
Syntax error: Unexpected "[" at [5:6]. If this is a table identifier, escape the name with `, e.g. `table.name` rather than [table.name].
The query is:
SELECT
f.repo_name,
f.path,
c.pkey
FROM
[bigquery-public-data:github_repos.files] f
JOIN (
SELECT
id,
You are probably using Standard SQL -- which is a good thing.
Try writing the table reference as:
FROM `bigquery-public-data.github_repos.files` f
Today I wrote this bit of sql:
SELECT COUNT(T0021_werk_naam)
FROM (SELECT Distinct T0021_werk_naam,T0021_jaar,T0021_kwartiel
FROM T0021_offertes
WHERE T0021_status_code = 'G' AND T0021_jaar = 2013 AND (T0021_kwartiel = 3))
This sql runs great when I run it locally in access, however, when I run it through the code that has been used for ages for this and most certainly definetly is not the problem, and send it to SQL Express it gives an error that says there's a problem near ')'
After stripping away all the brackets possible it becomes clear that it detects there's a problem with the last ')' but I don't see the problem.
Any Ideas?
You need to give an alias for the select in the parenthesis:
SELECT COUNT(T0021_werk_naam)
FROM (
SELECT Distinct T0021_werk_naam,
T0021_jaar,
T0021_kwartiel
FROM T0021_offertes
WHERE T0021_status_code = 'G'
AND T0021_jaar = 2013
AND (T0021_kwartiel = 3)
) T
notice the T in the end after the last parenthesis.