aws Glue: Is it possible to pull only specific data from a database? - sql

I need to transform a fairly big database table with aws Glue to csv. However I only the newest table rows from the past 24 hours. There ist a column which specifies the creation date of the row. Is it possible, to just transform these rows, without copying the whole table into the csv file? I am using a python script with Spark.
Thank you very much in advance!

There are some Built-in Transforms in AWS Glue which are used to process your data. This transfers can be called from ETL scripts.
Please refer the below link for the same :
https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html

You haven't mentioned the type of database that you are trying connect. Anyway for JDBC connections spark has the option of query, in which you can issue the usual SQL query to get the rows you need.

Related

Enable sync between Big Query and snowflake

We are using BigQuery and SNOWFLAKE(Azure hosted) and we often export data from big query and import to SNOWFLAKE and vice versa. is there any easy way to integrate both systems like automatically sync big query table to SNOWFLAKE rather than exporting to file and importing ?
You should have a look on Change Data Capture Solutions for automate sync.
Some of them got native Big Query and Snowflake connectors.
Some examples :
HVR
Qlik Replicate
Striim
...
There are many ways to implement this, and the best one will depend on the nature of your data.
For example, if every day you have new data in BigQuery, then all you need to do is set up a daily export of the new data from BigQuery to GCS. Then it's easy to set up Snowflake to read new data in GCS whenever it shows up with Snowpipe:
https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-gcs.html
But then how often do you want to sync this data? Is it append only, or does it need to account for past data changing? How do you solve conflicts when the same row changes in different ways on both sides? Etc.
I have the same scenario. I've built this template in Jupyter Notebook. I've done a gap analysis after a few days and, at least in our case, it seems that Firebase/Google Analytics adds more rows to the already compiled daily tables even a few days later. We have about 10% more rows in a BigQuery older Daily than what was captured in Snowflake so be mindful of the gap. To this date, the template I've created is not able to handle the missing rows. For us it works because we look at aggregated values (daily active users, retention...etc) and the gap there is minimal.
You could use Sling, which I worked on. It is a tool that allows to copy data between databases (including BQ source and SF destination) using bulk loading methodologies. There is a free CLI version and a Cloud (hosted) version. I actually wrote a blog entry about this in detail (albeit AWS destination, similar logic), but essentially, if you use the CLI version, you can run one command after setting up your credentials:
$ sling run --src-conn BIGQUERY --src-stream segment.team_activity --tgt-conn SNOWFLAKE --tgt-object public.activity_teams --mode full-refresh
11:37AM INF connecting to source database (bigquery)
11:37AM INF connecting to target database (snowflake)
11:37AM INF reading from source database
11:37AM INF writing to target database [mode: full-refresh]
11:37AM INF streaming data
11:37AM INF dropped table public.activity_teams
11:38AM INF created table public.activity_teams
11:38AM INF inserted 77668 rows
11:38AM INF execution succeeded

Checking of replicated data Pentaho

I have about 100 tables to which we replicate data, e.g. from the Oracle database.
I would like to quickly check that the data replicated to the tables in db2 is the same as in the source system.
Does anyone have a way to do this? I can create 100 transformations, but that's monotonous and time consuming. I would prefer to process this in a loop.
I thought I would keep the queries in a table and reach into it for records.
I read the data from Table input (sql_db2, sql_source, table_name) and write do copy rows to result. Next I read single record and I read a single record and put it into a loop.
But here came a problem because I don't know how to dynamically compare the data for the tables. Each table has different columns and here I have a problem.
I don't know if this is also possible?
You can inject metadata (in this case your metadata would be the column and table names) to a lot of steps in Pentaho, you create a transformation to collect the metadata to inject to another transformation that has only the steps and some basic information, but the bulk of the information of the columns affected by the different steps is in the transformation injecting the metadata.
Check Pentaho official documentation about Metadata Injection (MDI) and the sample with a basic example of metadata injection available in your PDI installation.

Visualization Using Tableau

I am new to Tableau, and having performance issues and need some help. I have a hive query result in Azure Blob Storage named as part-00000.
The issue having this performance is I want to execute the custom query in Tableau and generates the graphical reports at Tableau.
So can I do this? How ?
I have 7.0 M Data in Hive table.
you can find custom query in data source connection check linked image
You might want to consider creating an extract instead of a live connection. Additional considerations would include hiding unused fields and using filters at the data source level to limit data as per requirement.

How to merge small files saved on hive by sparksql?

Like code below, insert a dataframe into a hive table. The output hdfs files of hive have too many small files. How to merge them when save on hive?
myDf.write.format("orc").partitionBy("datestr").insertInto("myHiveTable")
When there are 100 tasks, it will produce 100 small files.
Is using coalesce on dataframe a good idea?
myDf.coalesce(3).write.format("orc").partitionBy("datestr").insertInto("myHiveTable")
Why hive configures as below do not work?
sqlContext.sql("set hive.merge.mapfiles=true")
sqlContext.sql("set hive.merge.sparkfiles=false")
sqlContext.sql("set hive.merge.smallfiles.avgsize=16000000")
sqlContext.sql("set hive.merge.size.per.task=256000000")
Thanks a lot for any help.
I encounterd this problem and find issue-24940
Use/*+ COALESCE(numPartitions) */ or /*+ REPARTITION(numPartitions) */ in spark sql query will control output file numbers.
In my parctice I recommend second parm for users, because it will generate a new stage to do this job, while first parm won't which may lead the job dead because of fewer tasks in the last stage.
That's because SparkSQL returns number of files which coresponds to the number of the spark partitions. Even if dynamic partitioning config is on.
I faced the same problem. In my view, configurations mentioned above are only applicable to Hive on MapReduce engine: in my case, HiveQL commands work well (small files are being merged).
See Hive architecture for more detail.

PDI or mysqldump to extract data without blocking the database nor getting inconsistent data?

I have an ETL process that will run periodically. I was using kettle (PDI) to extract the data from the source database and copy it to a stage database. For this I use several transformations with table input and table output steps. However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data. Furthermore, I don't know if the source database would be blocked. This would be a problem if the extraction takes some minutes (and it will take them). The advantage of PDI is that I can select only the necessary columns and use timestamps to get only the new data.
By the other hand, I think mysqldump with --single-transaction allows me to get the data in a consistent way and don't block the source database (all tables are innodb). The disadventage is that I would get innecessary data.
Can I use PDI, or I need mysqldump?
PD: I need to read specific tables from specific databases, so I think xtrabackup it's not a good option.
However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data
I think "Table Input" step doesn't take into account any modifications that are happening when you are reading. Try a simple experiment:
Take a .ktr file with a single table input and table output. Try loading the data into the target table. While in the middle of data load, insert few records in the source database. You will find that those records are not read into the target table. (note i tried with postgresql db and the number of rows read is : 1000000)
Now for your question, i suggest you using PDI since it gives you more control on the data in terms of versioning, sequences, SCDs and all the DWBI related activities. PDI makes it easier to load to the stage env. rather than simply dumping the entire tables.
Hope it helps :)
Interesting point. If you do all the table inputs in one transformation then at least they all start at same time but whilst likely to be consistent it's not guaranteed.
There is no reason you can't use pdi to orchestrate the process AND use mysql dump. In fact for bulk insert or extract it's nearly always better to use the vendor provided tools.