I have a bunch of files on S3 that contain just MD5s, one per line. I created an AWS Athena table to run a de-duplication query against the MD5s. In total there are hundreds of millions of MD5s in those files and in the table.
Athena Table Creation Query:
CREATE EXTERNAL TABLE IF NOT EXISTS database.md5s (
`md5` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://bucket/folder/';
Here are all the "dedup" queries I've tried (These should all be the same):
SELECT DISTINCT md5
FROM md5s;`
SELECT md5
FROM md5s
GROUP BY md5;
SELECT md5
FROM md5s
GROUP BY DISTINCT md5;
SELECT DISTINCT md5
FROM md5s
GROUP BY DISTINCT md5;
All results output .csvs from Athena still have repeated MD5s. What gives?
Is Athena Doing Partial Deduplication? - Even more peculiar, if I perform a COUNT(DISTINCT md5) in Athena, the count I get is different than the number of rows returned on export.
COUNT(DISTINCT md5) in Athena: 97,533,226
records in export of distinct MD5s: 97,581,616
there 14,790 duplicates in the results export, so both the COUNT(DISTINCT) counts are bad, and the results export are bad.
Is Athena CREATING Duplicates on Export? - The plot thickens. If I query my Athena Table for one of the MD5s that is duplicated in the Athena result export, I only get one result/row from the table. I tested this with a LIKE query to make sure whitespace wasn't causing the issue. This means Athena is ADDING duplicates to the export. There are never more than two of the same MD5 in the results.
select
md5,
to_utf8(md5)
from md5s
where md5 like '%0061c3d72c2957f454eef9d4b05775d7%';
Are Athena's Counts & Results File Both Wrong? - I deduped these same records using MySQL, and ended up with 97,531,010 unique MD5s. Athenas counts and results details are below.
COUNT(DISTINCT md5) in Athena: 97,533,226
records in export of distinct MD5s: 97,581,616
there 14,790 duplicates in the results export, so it seems that both the COUNT(DISTINCT) counts are bad, and the results export are bad.
I think this is an Athena bug - I've filed a ticket with AWS's dev team to get this fixed, and will update this post when it is.
Here is the related AWS Forum Post where other users are seeing the same issues.
https://forums.aws.amazon.com/thread.jspa?messageID=764702
I have confirmed with the AWS team, that this was a known bug with AWS Athena at the time the question was asked. I'm not sure if this has been resolved.
When in doubt please use CTAS to remove any duplicates :
CREATE TABLE new_table
WITH (
format = 'Parquet',
parquet_compression = 'SNAPPY')
AS SELECT DISTINCT *
FROM old_table;
Reference: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
Related
I have two tables in Athena, which has md5 as one of the columns, and both the tables have around a billion entries. I need to find md5 that is common in both tables. I was trying with the following.
WITH t1 as (
SELECT table1.md5
from table1
),
t2 as (
SELECT table2.md5
from table2
)
SELECT md5 from t1 union all select md5 from t2 group by md5 having count(*)> 1
The query works when I limit the records but when run it with no limit... Athena query fails after running 20-30 mins. seems like too big of o/p that it can handle. There are no partitions etc. that I can use to load fewer data. I know the file it will generate in S3 for query o/p will be in GBs. Is there a way I can make this query better that Athena likes it, or do you have any other thoughts on how to achieve this?
I also tried to create a new table from the query o/p, but it seems eventual; first, the query needs to be completed before it will create a new table.
CREATE TABLE common_md5 (above query)
I tried to create an Athena table with 1700+ partitions where each partition has 100 buckets (not S3 buckets but Athena buckets, which are like hashes on top of high cardinality columns to speed up queries). Unfortunately I learned that Athena doesn't support more than 100 partitions, and if you use their workaround to support more than 100 partitions, you can't use any buckets. So then I thought, what if I made each partition its own separate table? Then I could give each table its own 100 buckets. The idea is that I would hide all of these tables under a SQL view. When the user specifies a partition, the view would pick the correct table, and query only that table. That led me to post this question.
How can I make a SQL view in Athena that conditionally queries tables? At first I thought it would be simple, but now I'm realizing it's not. I could naively union all the tables, and then run the condition on the unioned result, but that would defeat the purpose of the partitions because I would end up reading all of the data all the time. Is what I'm describing possible?
I want to do this but without the union:
CREATE VIEW example AS (
SELECT col1, col2, 1 as partition FROM partition1
UNION ALL
SELECT col1, col2, 2 as partition FROM partition2)
Where the user would specify something like:
SELECT col1, col2
FROM example
WHERE partition = 1
I'm trying to run a simple query with a wildcard table using standardSQL on Bigquery. Here's the code:
#standardSQL
SELECT dataset_id, SUM(totals.visits) AS sessions
FROM `dataset_*`
WHERE _TABLE_SUFFIX BETWEEN '20150518' AND '20210406'
GROUP BY 1
My sharded dataset contains one table each day since 18/05/2015. So today's table will be 'dataset_20150518'.
The error is: 'Wildcard table over non partitioning tables and field based partitioning tables is not yet supported, first normal table dataset_test, first column table dataset_20150518.'
I've tried different kinds of select and aggregations but the error won't fix. I just want to query on all tables in that timeframe.
This is because in the wildcard you have to have all the tables with same schema. In your case, you are also adding dataset_test which is not with the same schema than others (dataset_test is a partition table?)
You should be able to get around this limitation by deleting _test and other tables with different schema or by running this query:
#standardSQL
SELECT dataset_id, SUM(totals.visits) AS sessions
FROM `dataset_20*`
WHERE _TABLE_SUFFIX BETWEEN '150518' AND '210406'
GROUP BY 1
Official documentation
I have the following partitioned table in Athena (HIVE/Presto):
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
id STRING,
data STRING
)
PARTITIONED BY (
year string,
month string,
day string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3://mybucket';
Data is stored in s3 organized in a path structure like s3://mybucket/year=2020/month=01/day=30/.
I would like to know if the following query would leverage partitioning optimization:
SELECT
*
FROM
mydb.mytable
WHERE
(year='2020' AND month='08' AND day IN ('10', '11', '12')) OR
(year='2020' AND month='07' AND day IN ('29', '30', '31'));
I am assuming since IN operator will be transformed in a series of OR conditions, this will be still a query which will benefit by partitioning. Am I correct?
Unfortunately Athena does not expose information that would make it easier to understand how to optimise queries. Currently the only thing you can do is to run different variations of queries and look at the statistics returned in the GetQueryExecution API call.
One way to figure out if Athena will make use of partitioning in a query is to run the query with different values for the partition column and make sure that the amount of data scanned is different. If the amount of data is different Athena was able to prune partitions during query planning.
Yes, it's also mentioned int the documentation.
When Athena runs a query on a partitioned table, it checks to see if any partitioned columns are used in the WHERE clause of the query. If partitioned columns are used, Athena requests the AWS Glue Data Catalog to return the partition specification matching the specified partition columns. The partition specification includes the LOCATION property that tells Athena which Amazon S3 prefix to use when reading data. In this case, only data stored in this prefix is scanned. If you do not use partitioned columns in the WHERE clause, Athena scans all the files that belong to the table's partitions.
I have a data set with 2 fields storing Strings.
1.In SAS when I do a nodupkey on the dataset I get ~200 records.
2.In SQL when I do a SELECT DISTINCT / GROUP BY/ PARTITION BY I am getting ~2000 records. This SQL code is run on HIVE which is hosted on an AWS EMR server.
The data set I am working on has NULL in some of the records for on of the fields. I am not doing anything else apart from what I mentioned in point 1 and 2.
I am looking for explanation as to why there is a huge mismatch between these 2 when I am doing just a simple duplicate removal.
Distinct operates on all fields in select statement and the database will likely consider nulls and blanks as different.
SAS does not consider nulls and blanks as different and only filters based on the variables listed in the BY statement.