Aggregation with a min in pyspark

Aggregation with a min in pyspark - apache-spark-sql

I'm trying to aggregate datas dates on the minimum value. I tried to use a groupby() but it gives an error.
history= history.selectExpr('aaa'\
, 'bbb'\
, 'ccc'\
, 'date')
history=history.groupBy()('aaa','bbb','ccc',min('date'))
I tried first
history= history.selectExpr('aaa'\
, 'bbb'\
, 'ccc'\
, min('date'))
but it didn't work neither
Thank you :)

you can simply do that :
from pyspark.sql import functions as F
history.groupBy("AAA","BBB","CCC").agg(F.min("date"))

Related

Why same query results are different on BigQuery editor and sqlalchemy?

My bigquery query is :
SELECT d.type AS `error_type`, count('d.type') AS `count`
FROM `table_android`, unnest(`table_android`.`exceptions`) AS `d`
WHERE `table_android`.`event_timestamp` BETWEEN '2022-12-15' AND '2022-12-20' GROUP BY `error_type` ORDER BY `count` desc;
This query is working fine in bigquery editor. But same version of query with sqlalchemy I could not get same results.
sqlalchemy query :
sa.select(
sa.literal_column("d.type").label("Error_Type"),
sa.func.count("Error_Type").label("count"),
)
.select_from(sa.func.unnest(table_android.c.exceptions).alias("d"))
.group_by("Error_Type")
.order_by(desc("count"))
.where(table_android.c.event_timestamp.between('2022-12-15', '2022-12-20'))
.cte("main_table")
Correct result :
Wrong result:
I am using python-bigquery-sqlalchemy library. table_android.exceptions column struct is like that :
column types :
And this is render of sqlalchemy query :
SELECT `d`.`type` AS `Error_Type`, count(`d`.`type`) AS `count` FROM `table_android`, unnest(`table_android`.`exceptions`) AS `d` WHERE `table_android`.`event_timestamp` BETWEEN '2022-12-05' AND '2022-12-20' GROUP BY `Error_Type` ORDER BY `count` DESC
I see correct result in bigquery editor. But sqlalchemy is not shows correct result. How should i edit my sqlalchemy query for correct results ?

I don't have bigquery so I can't test this but I think you want column and not literal_column. Also I think you can create an implicit CROSS JOIN by including both the table and the unnested column in the select_from.
# I'm testing with postgresql but it doesn't have easy to make structs
from sqlalchemy.dialects import postgresql
table_android = Table("table_android", metadata,
Column("id", Integer, primary_key=True),
Column("exceptions", postgresql.ARRAY(String)),
Column("event_timestamp", Date))
from sqlalchemy.sql import column, func, select
d = func.unnest(table_android.c.exceptions).alias("d")
# You might be able to do d.column["type"].label("Error_Type")
type_col = column("d.type").label("Error_Type")
count_col = func.count(type_col).label("count")
print (select(
type_col,
count_col,
)
.select_from(table_android, d)
.group_by(type_col)
.order_by(count_col)
.where(table_android.c.event_timestamp.between('2022-12-15', '2022-12-20'))
.cte("main_table"))

How to query a database copied from clipboard using Pandas as DuckDB

I'm trying to test a simple SQL query which should do something like this:
import duckdb
import pandas as pd
df_test = pd.read_clipboard()
duckdb.query("SELECT * FROM df_test").df()
Which works but I can't get the following query to work.
select count(df_test) as cnt,
year(a) as yr
from df_test
where d = "outcome1"
group by yr
However, I get this as an exception, presumably from DuckDB.
BinderException: Binder Error: No function matches the given name and argument types 'year(VARCHAR)'. You might need to add explicit type casts.
Candidate functions:
year(TIMESTAMP WITH TIME ZONE) -> BIGINT
year(DATE) -> BIGINT
year(TIMESTAMP) -> BIGINT
year(INTERVAL) -> BIGINT
I was only using Pandas as it seemed to be the easiest way to convert a csv file (using pd.read_clipboard) into DuckDB.
Any ideas?
(I'm using a mac by the way).

Regexp_Replace in pyspark not working properly

I am reading a csv file which is something like:
"ZEN","123"
"TEN","567"
Now if I am replacing character E with regexp_replace , its not giving correct results:
from pyspark.sql.functions import
row_number,col,desc,date_format,to_date,to_timestamp,regexp_replace
inputDirPath="/FileStore/tables/test.csv"
schema = StructType()
for field in fields:
colType = StringType()
schema.add(field.strip(),colType,True)
incr_df = spark.read.format("csv").option("header",
"false").schema(schema).option("delimiter", "\u002c").option("nullValue",
"").option("emptyValue","").option("multiline",True).csv(inputDirPath)
for column in incr_df.columns:
inc_new=incr_df.withColumn(column, regexp_replace(column,"E","") )
inc_new.show()
is not giving correct results, it is doing nothing
Note : I have 100+ columns, so need to use for loop
can someone help in spotting my error?

List comprehension will be neater and easier. Lets try
inc_new =inc_new.select(*[regexp_replace(x,'E','').alias(x) for x in inc_new.columns])
inc_new.show()

Right way to implement pandas.read_sql with ClickHouse

Trying to implement pandas.read_sql function.
I created a clickhouse table and filled it:
create table regions
(
date DateTime Default now(),
region String
)
engine = MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY tuple()
SETTINGS index_granularity = 8192;
insert into regions (region) values ('Asia'), ('Europe')
Then python code:
import pandas as pd
from sqlalchemy import create_engine
uri = 'clickhouse://default:#localhost/default'
engine = create_engine(uri)
query = 'select * from regions'
pd.read_sql(query, engine)
As the result I expected to get a dataframe with columns date and region but all I get is empty dataframe:
Empty DataFrame
Columns: [2021-01-08 09:24:33, Asia]
Index: []
UPD. It occured that defining clickhouse+native solves the problem.
Can it be solved without +native?

There is encient issue https://github.com/xzkostyan/clickhouse-sqlalchemy/issues/10. Also there is a hint which assumes to add FORMAT TabSeparatedWithNamesAndTypes at the end of a query. So the init query will be look like this:
select *
from regions
FORMAT TabSeparatedWithNamesAndTypes

Converting a Sparksql Query to Dataframe transformations

I am trying to re-write a sparksql query into a dataframe transformation using groupby and aggregate . Below is the original sparksql query .
result = spark.sql(
"select date, Full_Subcategory, Budget_Type, SUM(measure_value) AS planned_sales_inputs FROM lookups GROUP BY date, Budget_Type, Full_Subcategory")
Below is the Dataframe transformation that i am trying to do .
df_lookups.groupBy('Full_Subcategory','Budget_Type','date').agg(col('measure_value'),sum('measure_value')).show()
But i keep getting the below error .
Py4JJavaError: An error occurred while calling o2475.agg.
: org.apache.spark.sql.AnalysisException: cannot resolve '`measure_value`' given input columns: [Full_Subcategory, Budget_Type, date];;
'Aggregate [Full_Subcategory#278, Budget_Type#279, date#413], [Full_Subcategory#278, Budget_Type#279, date#413, 'measure_value, sum('measure_value) AS sum(measure_value)#16168]
I am pretty sure this has something do with grouping by columns and those columns being present in the select clause .
Kindly help .

I think it's because you are doing col('measure_value') inside agg function, which does not make sense as for me, because you are not aggregating any value in such way.
Just remove col('measure_value') from agg and you will get right result.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Aggregation with a min in pyspark - apache-spark-sql

you can simply do that : from pyspark.sql import functions as F history.groupBy("AAA","BBB","CCC").agg(F.min("date"))

Related

Why same query results are different on BigQuery editor and sqlalchemy?

How to query a database copied from clipboard using Pandas as DuckDB

Regexp_Replace in pyspark not working properly

Right way to implement pandas.read_sql with ClickHouse

Converting a Sparksql Query to Dataframe transformations

Categories

Resources