How do you filter with a Dataframe, list, vector etc. to a table in a database in R? - sql

I have a large set of id-s which is in a csv file. How could I filter on a database table using only that one-columned table in the csv file?
For example in the ODBC database we have:
TABLE 1
+---------+------+
| ID | TYPE |
+---------+------+
| 43PRJIF | A |
| 35IRPFJ | A |
| 452JSU | B |
| 78JFIER | B |
| 48IRUW | C |
| 89UEJDU | C |
| 784NFJR | D |
| 326NFR | D |
| 733ZREW | E |
+---------+------+
And in the CSV file we have:
+---------+
| ID |
+---------+
| 89UEJDU |
| 784NFJR |
| 326NFR |
| 733ZREW |
+---------+
Basically I would like to use something from the dbplyr package if possible. E.g importing the csv table to a dataframe then use a syntax in dbplyr like:
new_table <- TABLE1 %>%
filter(id == "ROWS IN THE CSV")
To get an output like that:
+---------+------+
| ID | TYPE |
+---------+------+
| 89UEJDU | C |
| 784NFJR | D |
| 326NFR | D |
| 733ZREW | E |
+---------+------+
Thank you for your help in advance!

In general joining or merging tables requires them to share the same environment. Hence, there are three general options here:
Load the remote table into R's local workspace
Load the CSV table into the database and use a semi-join.
'Smuggle' the list of IDs in the CSV into the database
Let's consider each in turn:
Option 1
This is probably the simplest option but it requires that the remote/ODBC table is small enough to fit in R's working memory. If so, you can call local_table = collect(remote_table) to load the database table.
Option 2
dbplyr includes a command copy_to (ref) that lets you copy local tables via odbc to a database/remote connection. You will need to have permission to create tables in the remote environment.
This approach makes use of the DBI package. At the time of writing v1.0.0 of DBI on CRAN has some limitations when writing to non-default schemas. So you may need to upgrade to the development version on GitHub (here).
Your code will look something like:
DBI::dbWriteTable(db_connection,
DBI::Id(schema = "schema", table = "name")),
r_table_name)
Option 3
Smuggle the list of IDs into the database via the table definition. This is the same idea as here, and works best if the list of IDs is short.
Remote tables are essentially defined by the code/query that fetches their results. Hence the list of IDs can appear in the code that defines your remote table. Consider the following example:
library(dplyr)
library(dbplyr)
data(mtcars)
list_of_ids = c(1,2,3,4)
df = tbl_lazy(mtcars, con = simulate_mssql())
df %>% filter(ID %in% list_of_ids ) %>% show_query()
show_query() renders the code that defines the current version of the remote table. In the example above it returns the following - note that the list of IDs now appears in the code.
<SQL>
SELECT *
FROM `df`
WHERE (`ID` IN (1.0, 2.0, 3.0, 4.0))
If the list of IDs is very long, the size of this query will become a problem. Hence there is a limit on the number of IDs you can filter on using this approach (I have not tested this approach to find the limit - I seldom using the IN clause for a list of more than 10).

Quick solution is by using merge function
example:
table1 <- data.frame(
ID = c("4322", "2245", "3356"),
TYPE = c("B", "A", "A")
)
table2 <- data.frame(
ID = c("2245")
)
table3 <- merge(table2, table1, all.x = TRUE, by = "ID")
ID TYPE
2245 A
table3 was created by filtering table1 using table2

Related

Filtering Columns in PLSQL

I have a table with tons and tons of columns and I'm trying to select only certain columns based on the data the columns contain. The table is part of an application I'm building in Oracle APEX and looks something like this:
|Row Header|Criteria 1|Criteria 2| Criteria 3 | Criteria 4 |Criteria 5 |
|Category | Type A | Type B | Type B | Type A | Type A |
| ID | 2.3 | 2.4 | 2.5 | 3.1 | 3.2 |
| Part A | Yes | Yes | Yes | No | Yes |
| Part B | Yes | No | Yes | Yes | Yes |
| Part C | No | Yes | Yes | Yes | No |
It goes on like this for around 1000ish criteria and 100ish parts I need to find a way to select all the columns that are of a specific type to its own table using SQL.
Id Like the return to look like this:
|Row Header|Criteria 1|Criteria 5 |
|Category | Type A | Type A |
| ID | 3.1 | 3.2 |
| Part A | No | Yes |
| Part B | Yes | Yes |
| Part C | Yes | No |
This way I only have the columns showing that are part of the "Type A" Category and have an ID greater than 3.
I've looked into GROUP BY and FILTER functions that SQL has to offer as well as PIVOT and I don't believe these will help me, but I'd be happy to be proven wrong.
In a relational database, columns are meant to be discrete, non-repeating attributes of a thing. Rows are meant to be multiple instances of that thing. Your table is reversed, using columns for what should be rows, and rows for what should be columns. Another factor is that Oracle limits you to 1000 columns, and you start undergoing severe performance degradation when you exceed 254 columns. Tables simply weren't meant to have hundreds, let alone thousands, of columns. So first step is to pivot your table like this:
Criteria_No, Cat, ID, PtA, PtB, PtC
---------------------------------------------
Row 1: Criteria 1, Type A, 2.3, Yes, Yes, No
Row 2: Criteria 2, Type B, 2.4, Yes, No, Yes
Row 3: Criteria 3, Type B, 2.5, Yes, Yes, Yes
. . . thousands more
But even then, you mentioned that you have 100s of "parts", so Parts A, B, C aren't the only three - the series continues. If so, it would be a violation of normal form to have such a repeating list in a single row. So you have one more step to fix your design: Break this into three tables.
CRITERIA
Criteria_No, Cat, ID
---------------------------------------------
Row 1: Criteria 1, Type A, 2.3
Row 2: Criteria 2, Type B, 2.4
Row 3: Criteria 3, Type B, 2.5
PARTS
Part, anything-else-about-part
-----------------
Part A, blah
Part B, blah,
Part C, blah
. . .
And now the bridge table between them:
CRITERIA_PARTS
Criteria_No, Part
-----------------
1, Part A
1, Part B
1, Part C
2, Part A,
2, Part B,
. . . and so on
You should also place a foreign key on each of the bridge table columns to point to their respective parent tables to ensure data integrity.
Now you query by joining the tables together in your SQL.
Updated: you asked how to move data into this new criteria table from your existing one. Use dynamic SQL like this:
BEGIN
FOR i IN 1..1000
LOOP
EXECUTE IMMEDIATE 'INSERT INTO criteria (criteria_no,cat,id) SELECT criteria_'||i||',category,id FROM oldtable';
END LOOP;
COMMIT;
END;
But of course set the 1000 to the real # of category_n columns.

How to get column names from a query?

I have a specific query with joins and aliases, and I need to retrieve columns name for a REST request in Talend.
I'm using Talend Open Studio for Data Integration 6.2 and I've got an Oracle 11g database with a read-only account. I can execute scripts with Talend, For example the query:
select
u.name as "user",
f.name as "food",
e.rate
from
Users as u
join Eval as e on u.user_id = e.user_id
join Food as f on e.food_id = f.food_id
where
1 = 1
should give the following result:
+------+--------+------+
| user | food | rate |
+------+--------+------+
| Baba | Donuts | 16.0 |
| Baba | Cheese | 20.0 |
| Keke | Pasta | 12.5 |
| Keke | Cheese | 15.0 |
+------+--------+------+
And I try to get the columns (in the right order) as follows by using scripts or Talend:
+--------+
| Column |
+--------+
| user |
| food |
| rate |
+--------+
Is there a way to query the Oracle database to get the columns or using talend to retrieve them?
UPDATE
Thanks to Marmite Bomber, a duplicate has been identified here for the Oracle approach. Now we need a Talend approach to the problem.
You can try this on a tJavaRow, following your DBInput component :
for (java.lang.reflect.Field field: row1.getClass().getDeclaredFields()) {
context.columnName = field.getName();
System.out.println("Field name is " + context.columnName );
}
Spotted on talend help center here : https://community.talend.com/t5/Design-and-Development/resolved-how-to-get-the-column-names-in-a-data-flow/td-p/99172
You can extend this, and put the column list on your outputflow :
//add this inside the loop, and 'columnNames' as an output row in tJavaRow schema
output_row.columnNames+=context.columnName+";";
With a tNormalize after tJavaRow, you shoud get the expected result.
HereĀ“s a link to an oracle community thread which should answer your question
community.oracle.com
I am not able to write a comment, so posting this as an answer:
SELECT column_name
FROM all_tab_cols
WHERE table_name = 'table_name_here'

Getting difference from two tables and deleting it in SQL Server

I am facing issue in writing the logic of a query that deletes data which are not existing in either of the 2 tables.
For example, I have a tables "Stage" and "Parent". I am using composite primary keys to uniquely identity records(multiple primary keys).
Stage structure and Data
S_Column1(Primary) | PRIDATA1 | PRIDATA4
S_Column2(Primary) | PRIDATA2 | PRIDATA5
S_Column3(Primary) | PRIDATA3 | PRIDATA6
S_Column4 | DJUC | JDNC
S_Column5 | DSSDC | JDDOS
Parent structure and Data
P_Column1(Primary) | PRIDATA1 | PRIDATA4 | PRIDATA7
P_Column2(Primary) | PRIDATA2 | PRIDATA5 | PRIDATA8
P_Column3(Primary) | PRIDATA3 | PRIDATA6 | PRIDATA9
P_Column4 | DJUC | JDNC | FFED
P_Column5 | DSSDC | JDDOS | NHUY
The above is just a sample of structure and data of two tables.
So basically what I want to do is write a query to delete the row that have
PRIDATA7 PRIDATA8 and PRIDATA9 as primary key because their entries are not present in STAGE TABLE.
I am not skilled, but I know I need to find out matching data using JOIN and delete the rest of the data from PARENT TABLE whose entries aren't present in STAGE TABLE
PS: I will be using this in a Trigger.
Try not exists
delete from parent
where not exists (
select 1
from stage s
where s.S_Column1 = parent.S_Column1 and s.S_Column2 = parent.S_Column2 and s.S_Column3 = parent.S_Column3)
You might be looking for the EXCEPT operator.
Read here: https://msdn.microsoft.com/pl-pl/library/ms188055(v=sql.110).aspx

SQL query to format table data for DataSource in GridView

I am looking for a SQL Server query that could transfer source SQL table data:
TextID | Text | LanguageID
-------|-------|-------------------------------------
app.aa | Hi | 6a13ea09-46ea-4c93-9b6a-e26bdc6ff4d8
app.cc | Hund | 0c894bb7-4937-4903-906a-d1b1dd64935c
app.aa | Hallo | 0c894bb7-4937-4903-906a-d1b1dd64935c
app.cc | Dog | 6a13ea09-46ea-4c93-9b6a-e26bdc6ff4d8
app.bb | Star | 6a13ea09-46ea-4c93-9b6a-e26bdc6ff4d8
...
into table like this one:
TextID | Original | Translated
-------|----------|-----------
app.aa | Hi | Hallo
app.bb | Star | -
app.cc | Dog | Hund
...
so that I can use it as a DataSource for GridView in ASP .NET. Thank you in advance for your help.
Whenever you need to combine data from two different rows into one, you need to join. For example:
select src.TextID "TextID", src.Text "Original", tr.Text "Translated"
from source_table src
left join source_table tr
on src.TextID = tr.TextID
and src.LangID = 'xxx' -- xxx is the source language id
and tr.LangID = 'yyy' -- yyy is the target language id
The left join ensures that untranslated words are included with a null translated value. To make a table for your DataSource, you'll need to wrap create table (or maybe create view) around the select:
create table translations as
select ...

SQL Query converting to Rails Active Record Query Interface

I have been using sql queries in my rails code which needs to be transitioned to Active Record Query. I haven't used Active Record before so i tried going through http://guides.rubyonrails.org/active_record_querying.html to get the proper syntax to be able to switch to this method of getting the data. I am able to convert the simple queries into this format but there are other complex queries like
SELECT b.owner,
Sum(a.idle_total),
Sum(a.idle_monthly_usage)
FROM market_place_idle_hosts_summaries a,
(SELECT DISTINCT owner,
hostclass,
week_number
FROM market_place_idle_hosts_details
WHERE week_number = '#{week_num}'
AND Year(updated_at) = '#{year_num}') b
WHERE a.hostclass = b.hostclass
AND a.week_number = b.week_number
AND Year(updated_at) = '#{year_num}'
GROUP BY b.owner
ORDER BY Sum(a.idle_monthly_usage) DESC
which i need in Active Record format but because of the complexity I am stuck as to how to proceed with the conversion.
The output of the query is something like this
+----------+-------------------+---------------------------+
| owner | sum(a.idle_total) | sum(a.idle_monthly_usage) |
+----------+-------------------+---------------------------+
| abc | 485 | 90387.13690185547 |
| xyz | 815 | 66242.01857376099 |
| qwe | 122 | 11730.609939575195 |
| asd | 80 | 9543.170425415039 |
| zxc | 87 | 8027.090087890625 |
| dfg | 67 | 7303.070011138916 |
| wqer | 76 | 5234.969814300537 |
Instead of converting it to an active record, you can use the find_by_sql method. Since your query is a bit complex.
You can use also use ActiveRecord::Base.connection, directly to fetch the records.
like this,
ActiveRecord::Base.connection.execute("your query")
You can create the subquery apart with ActiveRecord and convert it to sql using to_sql
Then use joins to join your table a with the b one, that it is the subquery. Note also the use of the active record clauses select, where, group and order that are basically what you need to build this complex SQL query in ActiveRecord.
Something similar to the following will work:
subquery = SubModel.select("DISTINCT ... ").where(" ... ").to_sql
Model.select("b.owner, ... ")
.joins("JOIN (#{subquery}) b ON a.hostclass = b.hostclass")
.where(" ... ")
.group("b.owner")
.order("Sum(a.idle_monthly_usage) DESC")