Pandas and SQLAlchemy: renaming columns during join - pandas

I have table A and table B. Both have a column id and a column name.
When I use pd.read_sql() to convert the result of a SQLAlchemy query to a pandas DataFrame, the resulting DataFrame has two columns named id and two columns named name.
The join is executed on the id column, therefore, even if there are two id columns, there won't be any ambiguity since both columns contain the same values. I can simply drop one of the column.
The two columns named name represent an issue because they are not identical: column name of table A represents name of an entity A, while column name of table B represents name of an entity B. At this point I won't know for sure which of the two columns of the DataFrame comes from table A and which from table B. Is there any way to solve this by, for instance, adding a prefix to the column names? More in general, is there any way to exploit the practical pd.from_sql() in this situation?
my_dataframe = pd.read_sql(
session.query(TableA, TableB)
.join(TableB)
.statement,
session.bind)
Note: in this question I am trying to simplify the structure of a more complex preexisting Postgres database. Therefore, it won't be possible to alter the structure of the database.

The solution was actually really simple, but you have to rename each single field:
my_dataframe = pd.read_sql(
session.query(TableA.field1.label('my_new_name1'),
TableA.field2.label('my_new_name2'),
TableB.field1.label('my_other_name2'))
.join(TableB)
.statement,
session.bind)

Related

Is there a SQL query to prevent two identical names in a column? And a query for the columns not to be left blank?

I have created mutiple tables inside a database using PhpMyAdmin. But I cant find out how to do this:
It should not be allowed with two identical names in one of my columns. The column is called "name".
And I have one column called "prod_time" and one called "stock_ant" that must be filled in. (Like it's not going to be an option to leave it blank or with zero value)
Is there multiple queries to use for these actions?
If you want a column to have unique values, use a unique constraint or index. For instance:
alter table t add constraint unq_t_name unique (name);
If you don't want columns to have NULL values, then declare them NOT NULL when you create the table.

Power Pivot relationships

Trying to create relationships (joins) between tables in power pivot.
Got 2 tables I wold like to join together, connected with a common column = CustomerID.
One is a Fact Table the other Dim table (look up).
I have run the "remove duplicates" on both tables without any problem.
But I still get an error saying : "the relationship cannot be created because each column contains duplicate values. Select at least one column that contains only unique values".
The Fact Table contains duplicates (as it should?) and the Dim Table do not, why do I get this error?
Help much appreciated
Created an appended table with both columns "CustomerID". After the columns where appended together I could "remove duplicates" and connect the tables together through the newly created appended table.
Don't know if this causes another problem later however.
You can also check for duplicate id values in a column by using the group by feature.
Remove all columns except ID, add a column that consists only of the number 1.
Group by ID, summing the content of the added column and filter out IDs whose total equals 1. What's left are duplicated IDs.

Renaming two columns or swapping the values? Which one is better?

I have a table with more than 1.5 million records, in which I have two columns, A and B. Mistakenly the column values of A got inserted into the column B and column B's values got inserted to A.
Recently only we found the issue. What will be the best option to correct this issue? Rename the column names interchangingly (I don't know how it can be possible, since if we nename A to B, when B already exists), or swapping the values contained in the two columns?
Hi, You can have the below query to swap the columns,
UPDATE table_name SET A = B, B = A;
But you have huge amount of date in this case renaming will be good. But renaming of column name because of data issue is not a right solution. So you can have above update query to update your data.
Before updating take a backup of table which you are updating using the query,
CREATE TABLE table_name_bkp AS SELECT * FROM table_name;
Always have a backup while playing with original data which will not mess up
15 lakh rows aren't a big deal for SQL server. Switching column names have many cons in relational DB such as index, foreign Key and also you may have to do lots of impacts. So, I would like to suggest to go for traditional path. Simply do the update.

Hive to Hive ETL

I have two large Hive tables, say TableA and TableB (which get loaded from different sources).
These two tables have almost identical table structure / columns with same partition column, a date stored as string.
I need to filter records from each table based on certain (identical) filter criteria.
These tables have some columns containing "codes", which need to be looked up to get its corresponding "values".
There are eight to ten such lookup tables, say, LookupA, LookupB, LookupC, etc.,
Now, I need to:
do a union of those filtered records from TableA and TableB.
do a lookup into the lookup tables and replace those "codes" from the filtered records with their respective "values". If a "code" or "value" is unavailable in the filtered records or lookup table respectively, I need to substitute it with zero or an empty string
transform the dates in the filtered records from one format to another
I am a beginner in Hive. Please let know how I can do it. Thanks.
Note: I can manage till union of the tables. Need some guidance on lookup and transformation.
To basically do a lookup Please follow these steps below,
You have to create a custom User Defined function(UDF) which basically does the look up work,meaning you have to create a Java Program internally for looking up, jar it and add it to Hive something like below:
ADD JAR /home/ubuntu/lookup.jar
You then have to add lookup file containing keyvalue pair as follows:
ADD FILE /home/ubuntu/lookupA;
You then have to create a temporary lookup function such as
CREATE TEMPORARY FUNCTION getLookupValueA AS 'com.LookupA';
Finally you have to call this lookup function in the Select query which will basically populate lookup value for the given lookup key.
Same thing can be achieved using JOIN but that will take a hit on the performance.
Taking a join approach you can very well join by the lookupcode for source and lookup tables something like
select a.key,b.lookupvalue
table a join lookuptable b
where a.key=b.lookupKey
Now for Date Transformation, you can use Date functions in Hive.
For the above problem follow the following steps:
Use union schema to union two tables(schema must be same).
For the above scenario you can try pig script.
script would look like(jn table A and tableB with lookup table and generate the appropriate columns):
a = join TableA by codesA left outer, lookupA by codesA.
b = join a by codesB left outer, lookupB by codesB.
Similarly for Table B.
Suppose some value of codesA does not have a value in the lookup table, then:
z = foreach b generate codesA as codesA, valueA is null ? '0' as valuesA.
(will replace all null values from value with 0).
If you are using Pig 0.12 or later, you can use ToString(CurrentTime(),'yyyy-MM-dd')
I hope it will solve your problem. Let me know in case of any concern.

T-SQL - Using Column Name in where condition without any references when having multiple tables that are engaged using multiple joins

I am a newbie for T-Sql, I came across a SP where multiple tables are engaged using multiple joins but the where clause contain a column field without any table reference and assigned for an incoming variable,like
where 'UserId = #UserId'
instead - no table reference like
'a.UserId = #Userid'`
Can any please do refer to me any material that clears my mind regarding such issue.
If the query works it means that there is only one Column with the name UserId, if there are multiple columns with the same name you have to reference the table too.
If you don't specify the table reference you will get
Ambiguous column name 'UserId'. error
Which means there are more then 2 tables with a column name UserId.
Anyway, always try and use the reference table.