Connection Timeout Error while reading the table having more than 100 columns in Mosaic Decisions - mosaic-decisions

I am reading a table via snowflake reader node having less number of columns/attributes(around 50-80),the table is getting read on the Mosaic decisions Canvas. But when the attributes of table increases (approx 385 columns),Mosaic reader node fails. As a workaround I tried using the where clause with 1=2,in that case it is pulling the structure of the Table. But when I am trying to read the records even by applying the limit (only 10 records) to the query, it is throwing connection timeout Error.

Even I faced similar issue while reading (approx. 300 columns) table and I managed it with the help of input parameters available in Mosaic. In your case you will have to change the copy field variable to 1=1 used in the query at run time.
Below steps can be referred to achieve this -
Create a parameter (e.g. copy_variable) that will contain the default value 2 for the copy field variable
In reader node, write the SQL with 1 = $(copy_variable) So while validating, it’s same as 1=2 condition and it should validate fine.
Once validated and schema is generated, update the default value of $(copy_variable) to 1 so that while running, you will still get all records.

Related

Superset with Impala - Invalid Session id/No protocol version header

I have Superset using Impala as the main data source. Most of the times, every query runs smoothly and I can build charts and dashboards with ease. I need to generate a Table Chart, containing around 100k records and 30+ columns, but I am having some issues. It is basically a SELECT *, no aggregations, filtering or ordering
are being used.
When the data is relatively big, Superset just throws a bunch of errors (It appears to be that the errors are coming from Impala). But I cannot find any
information regarding those errors. I have tried paginating the results, but it did not worked. Also, when I run the query in Superset Chart page, it doesn't take long, it just
displays the error. The only way some information gets displayed in the Table Chart is when I limit the rows at the "Row limit" option to 10 records. But, this will not work out for me.
Those are the errors that keep ocurring:
impala error: Invalid session id: f344bf1aa2a42e2b:ad1df0047d7f909c
impala error: No protocol version header
When I use the Oracle connection that I also have, I can generate a table chart from a large amount of records with no problem.
My setup is the following:
Impala v3.2.0-cdh6.3.3
Superset v0.36.0
So, is that a problem with Superset or Impala? Could have something to with configuration in Superset?

dbGetQuery retrieves less number of rows

I am trying to fetch a large dataset into R environment using ODBC Connection
When I try to retrieve data from a large dataset using dbGetQuery() function, the number of rows are less than what is see on hive. Sometimes, the same code fetches me correct number of rows. Could some help me if i should clear any buffer before fetching the data?
hive_con <- dbConnect(odbc::odbc(),.connection_string=Connection_String)
qry<-"select * from mytable"
rslt<-dbGetQuery(hive_con,qry)
I have tried changing n parameter of the function dbGetQuery(). But still the problem persists
Finally, i found that i was using HTTP protocol for data extraction. This protocol doesn't show any warning if the data loss happens.

Change data type in table due to Disk Space/Memory Error

Attempts at changing data type in Access have failed due to error:
"There isn't enough disk space or memory". Over 385,325 records exists in the table.
Attempts at the following links, among other StackOverFlow threads, have failed:
Can't change data type on MS Access 2007
Microsoft Access can't change the datatype. There isn't enough disk space or memory
The intention is to change data type for one column from "Text" to "Number". The aforementioned links cannot accommodate that either due to size or the desired data type fields.
Breaking out the table may not be an option due to the number of records.
Help on this would be appreciated.
I cannot tell for sure about MS Access, but for MS SQL one can avoid a table rebuild (requiring lots of time and space) by appending a new column that allows null- values at the rightmost end of the table, update the column using normal update queries and AFAIK even drop the old column and rename the new one. So in the end it's just the location of that column that has changed.
As for your 385,325 records (I'd expect that number to be correct) even if the table had 1000 columns with 500 unicode- characters each we'd end up with approximately 385,325*1000*500*2 ~ 385 GB of data. That should nowadays not be the maximum available - so:
if it's the disk space you're running out of, how about move the data to some other computer, change the DB there and move it back.
if the DB seems to be corrupted (and standard tools didn't help (make a copy)) it will most probably help to create a new table or database using table creation (better: create manually and append) queries.

Target Based commit point while updating into table

One of my mappings is running for a really long time (2 hours).From the session log i can see the statment "Time out based commit poin" which is tking most of the time and Busy percentage for the SQL tranfsormation is very high(which is taking time,I ran the SQL query manually in DB,its working fine ).So, basically there is a router which splits the record between insert and update.And the update stream is taking long.It has a SQL transforamtion,Update statrtergy and aggregator.I added an sorter before aggregator but no luck.
Also changed comit interval ,Lins Sequential Buffer lenght and Maximum memory allowed by checking some of the other blogs.Could you please help me with this.
If possible try to avoid the transformations which are creating cache because in future if the input records increase. Cache size will also increase and decrease the throughput
1) Aggregator : Try to use the Aggregation in SQL override itself
2) Sorter : Try to do the same in the SQL Override itself
Generally SQL transformation is slow for huge data loads, because for each input record an SQL session is invoked and a connection is established to database and the row is fetched. Say for example there are 1 million records, 1 million SQL sessions are initiated in the backend and the database is called.
What the SQL transformation doing ? Is it just generating a Surrogate key or its fetching a value from a table based on derived value from the stream
For fetching a value from a table based on derived value from the stream:
Try to use lookup
For generating Surrogate key, Use Oracle Sequence instead
Let me know if its purpose is any thing other than that
Also do the below checks
Sort the session log on thread and just make a note of start and end times of
the following
1) lookup caches creation (time between Query issued --> First row returned --> Cache creation completed)
2) Reader thread first row return time
Regards,
Raj

looping in a Kettle transformation

I want to repetitively execute an SQL query looking like this:
SELECT '${date.i}' AS d,
COUNT(DISTINCT xid) AS n
FROM table
WHERE date
BETWEEN DATE_SUB('${date.i}', INTERVAL 6 DAY)
AND '${date.i}'
;
It is basically a grouping by time spans, just that those are intersecting, which prevents usage of GROUP BY.
That is why I want to execute the query repetitively for every day in a certain time span. But I am not sure how I should implement the loop. What solution would you suggest?
The Kettle variable date.i is initialized from a global variable. The transformation is just one of several in the same transformation bundle. The "stop trafo" would be implemented maybe implicitely by just not reentering the loop.
Here's the flow chart:
Flow of the transformation:
In step "INPUT" I create a result set with three identical fields keeping the dates from ${date.from} until ${date.until} (Kettle variables). (for details on this technique check out my article on it - Generating virtual tables for JOIN operations in MySQL).
In step "SELECT" I set the data source to be used ("INPUT") and that I want "SELECT" to be executed for every row in the served result set. Because Kettle maps parameters 1 on 1 by a faceless question-mark I have to serve three times the same paramter - for each usage.
The "text file output" finally outputs the result in a generical fashion. Just a filename has to be set.
Content of the resulting text output for 2013-01-01 until 2013-01-05:
d;n
2013/01/01 00:00:00.000;3038
2013/01/02 00:00:00.000;2405
2013/01/03 00:00:00.000;2055
2013/01/04 00:00:00.000;2796
2013/01/05 00:00:00.000;2687
I am not sure if this is the slickest solution but it does the trick.
In Kettle you want to avoid loops and they can cause real trouble in transforms. Instead you should do this by adding a step that will put a row in the stream for each date you want (with the value stored in a field) and then using that field value in the query.
ETA: The stream is the thing that moves rows (records) between steps. It may help to think of it as consisting of a table at each hop that temporarily holds rows between steps.
You want to avoid loops because a Kettle transform is only sequential at the row level: rows may process in parallel and out of order and the only guarantee is that the row will pass through the steps in order. Because of this a loop in a transform does not function as you would intuitively expect.
FYI, it also sounds like you might need to go through some of the Kettle tutorials if you are still unclear about what the stream is.