Order of the columns in Apache Zeppelin when selecting the data from the temprorary table is wrong, how to put specific column first? - dataframe

Currently we have the scala DataFrame output with id value shown first (but it is chronologically added to the DataFrame last). Other columns appears dynamically based on .pivot() function and the data.
When I call for the data in %sql interpreter, the order is changing, thus making CSV file that I download also have id column as the last one, that doesn't work for me. I can't just write the selection script at once with putting the id column at the first point manually, as I can't control other columns because of pivot. Is there any other way to make specific column go first?
The Scala paragraph is:
resultMean.registerTempTable("mean")
The sql paragraph is:
%sql
select *
from mean

For someone who will read this in future, the reason of such a behavior is in misusing the DataFrame. In Scala .show() was applied to one DataFrame, while the export to the temp table to another one. If you face the same, please double check you apply your methods to the same objects.

Related

Pentaho step - Use SQL functions to add a column in data before dumping it int DB

I am fairly new to Pentaho, and while working on it, I have stumbled across a problem. Below is how my flow is:
Read input from a file. Let's say this has 5 columns.
Make some modifications to existing columns. (Filter, modify and all).
Add a new column, which will be equal to an SQL function of the current row data. Example, it can be sum(id, id+1)
Dump to the database.
Step 1,2, 4 are already in place and are working fine. It's Step 3 where I am stuck. I've tried to execute SQL, but that is only for Modifying DDL and doesn't return data. Table input needs data to be in a table already, which isn't the case with me.
I have a workaround, that I can enter all rows in DB, and then fire an update query, but I was hoping if there is a better way to do this.
You can add formula step and in the formula column, you can specify what you want to achieve. For example, your other column+1 and save it in a new field or also replace the existing field value

Create table schema and load data in bigquery table using source google drive

I am creating table using google drive as a source and google sheet as a format.
I have selected "Drive" as a value for create table from. For file Format, I selected Google Sheet.
Also I selected the Auto Detect Schema and input parameters.
Its creating the table but the first row of the sheet is also loaded as a data instead of table fields.
Kindly tell me what I need to do to get the first row of the sheet as a table column name not as a data.
It would have been helpful if you could include a screenshot of the top few rows of the file you're trying to upload at least to see the data types you have in there. BigQuery, at least as of when this response was composed, cannot differentiate between column names and data rows if both have similar datatypes while schema auto detection is used. For instance, if your data looks like this:
headerA, headerB
row1a, row1b
row2a, row2b
row3a, row3b
BigQuery would not be able to detect the column names (at least automatically using the UI options alone) since all the headers and row data are Strings. The "Header rows to skip" option would not help with this.
Schema auto detection should be able to detect and differentiate column names from data rows when you have different data types for different columns though.
You have an option to skip header row in Advanced options. Simply put 1 as the number of rows to skip (your first row is where your header is). It will skip the first row and use it as the values for your header.

How to query the presence of an element inside a Spark Dataframe Column that contains a set?

I have a spark dataframe where one column has the type Set<text>.
This column contains a set of string, for example ["eenie","meenie","mo"].
How do I filter the contents of the whole dataframe so that
I only get those rows that (for example) contain the value eenie in the set?
I'm looking for something similar to
dataframe.where($"list".contains("eenie"))
the above shown example is only valid for when the content of column list is a string not a Set. What alternatives are there to fit my circumstances?
Edit: My question is not a duplicate. The user in that question has a set of values and wants to know which ones are located inside a specific column. I have a column that contains a set, and I want to know if a specific value is part of the set. My approach is the opposite of that.
Try:
import org.apache.spark.sql.functions.array_contains
dataframe.where(array_contains($"list", "eenie"))

Retrieve results from a batch of SQL queries in Pentaho or Postgres?

I'm still relatively new to SQL and Pentaho.
I've pulled a table with two different IDs and need to run a query for each specific instance.
For example,
SELECT *
FROM Table
WHERE RecordA = 'value in column A'
AND RecordB = 'value in column B'
I need the results back, either appended to new columns in the original table or part of their own text file output.
I was initially looking at using a formula for this inside of Pentaho, but couldn't quite figure it out. Since I have the query written I threw it into Excel and got the concatenated results (so a string of 350 or so queries that I need to run). I'm just not sure how to accomplish this - I tried the Execute SQL Script inside of Pentaho but it doesn't seem to do output?
Any direction would be useful. I've searched a little but have come up short so far, possibly because I am still pretty new to this platform.
You can accomplish this behavior in a lot of ways, with a "Database Lookup" step for example, but I usually do that in a quite easy way and here is a example for your tests, I hope it helps.
The idea here is to have two Table input steps, the first one will fetch the IDs we want to look at. For example you may use a SQL query similar to note on the left. The result will be a 1 column stream of rows.
Next we have a Table Input that reads the rows received and executes it's query for each row. I'll add a screenshot with the options that I selected.
What it does is replace a placeholder '?' with the data that is received. If you need two columns use two '?' but remember that it will replace the first one with the first column and the second one with the second column
And you are good to go. Test it a couple of times and good luck.
And the config for the second table input.

Use columns.add(...) in Word with non-uniform column widths?

Problem I'm having is that table.Columns.add(ref Object BeforeColumn) requires a reference to another column in the table. However, when I try to access the last column in the table to pass as a reference using table.Columns.Add(table.Columns[table.Columns.Count])
I get the error:
"Cannot access individual columns in this collection because the table has mixed cell widths."
As my current work around, I catch the error, and call table.Columns.DistributeWidth() to make sure the columns are uniform and run the rest of the code. However, I lose the formatting of my cell widths this way, which is unfortunate.
Is there any way I can workaround this without losing the cell width?
(I realize one way is to store every cell's width before running this process, and then re-applying the widths afterward, but this seems like a very costly solution to something that should be simpler)
I've found one way to do it. Here's how I approached it.
*Caution, I'm assuming that the table is uniform. i.e. The number of columns is the same across all the rows. (Note, the API has a Table.uniform function, but the description is not complete. In the API it says "True if all the rows in a table have the same number of columns." However, it also checks if the columns have uniform width).
Instead of using table.Columns.Add(table.Columns[table.Columns.Count]) to add a column before the last below, I select a cell in the table and used the insert command:
//assuming table is the name of the table you want to add columns to
table.Cell(1, table.Columns.Count).Select();
word.Selection selection = table.Application.ActiveWindow.Selection;
selection.InsertColumns();
This might actually be a better way to add columns, as the api gives you way more options on how to insert (i.e. use InsertColumnsRight to insert to the right of the column). The Columns.Add(...) function by default inserts to the left of the select