pandas read sql query improvement - pandas

So I downloaded some data from a database which conveniently has a sequential ID column. I saved the max ID for each table I am querying to a small text file which I read into memory (max_ids dataframe).
I was trying to create a query where I would say give me all of the data where the Idcol > max_id for that table. I was getting errors that Series are mutable so I could not use them in a parameter. The code below ended up working but it was literally just a guess and check process. I turned it into an int and then a string which basically extracted the actual value from the dataframe.
Is this the correct way to accomplish what I am trying to do before I replicate this for about 32 different tables? I want to always be able to grab only the latest data from these tables which I am then doing stuff to in pandas and eventually consolidating and exporting to another database.
df= pd.read_sql_query('SELECT * FROM table WHERE Idcol > %s;', engine, params={'max_id', str(int(max_ids['table_max']))})
Can I also make the table name more dynamic as well? I need to go through a list of tables. The database is MS SQL and I am using pymssql and sqlalchemy.
Here is an example of where I ran max_ids['table_max']:
Out[11]:
0 1900564174
Name: max_id, dtype: int64

assuming that your max_ids DF looks as following:
In [24]: max_ids
Out[24]:
table table_max
0 tab_a 33333
1 tab_b 555555
2 tab_c 66666666
you can do it this way:
qry = 'SELECT * FROM {} WHERE Idcol > :max_id'
for i, r in max_ids.iterrows():
print('Executing: [%s], max_id: %s' %(qry.format(r['table']), r['table_max']))
pd.read_sql_query(qry.format(r['table']), engine, params={'max_id': r['table_max']})

Related

SQL concat value as column

I'm trying to convert one string to a valid column.
As you can see I need to get something like 'mo._olddb_uid_'+nu._olddb_name_db that should use the column mo._olddb_uid_001
How can I achieve this:
SELECT *
FROM [PI_CONSOLIDATION].[dbo].[new_unite] nu
LEFT JOIN [PI_CONSOLIDATION].[dbo].[Motif_Orientation] mo ON CONCAT('mo._olddb_uid_',nu._olddb_name_db) = CONCAT(nu._olddb_name_db,'_',nu.id_motif_orientation)
WHERE nu.nom_res = 'TEST' and nu.prenom_res = 'Foobar'
Thanks
EDITED:
I have 18 application each with his DB. Those applications are almost similar with different data but sometimes the data can be found on also on the other sources.
So [new_unite] has the id of patient, id_group and the source database
id_resident id_groupe_res _olddb_name_db
728 31 src1
629 21 src6
731 25 src9
934 12 src18
...
The other table has some parameters that have identical params but with different IDs depending on the DB of the source.
So the [Motif_Orientation] looks like :
So mainly this query is just only to test if the data is stored correctly on the final application where there is just one DB and all the data merged
id_motif_orientation label
1407 Famille
1410 Structures d'hébergement
1422 Etablissement d'Education Spéciale
What you could try to do is make a function with the output being the value you want to left join. Your output should be a single value doesn't matter the type. Then, call it as:
LEFT JOIN
(func_name(param1,param2,...) AS CONCAT(...) FROM DUAL)
ON 1=1
The result should be your value left joined to the current table under whatever column name you need. The only problem is that you will need to do this one by one for each row, so it is only really useful for smaller tables. Wish you gave more info so I could give a better answer but this worked for me when I was having a similar issue with renaming a column.

Find duplicates in large dataset Excel

I have this task that seems to be recurring and I would need a better solution for.
I pull data from two different databases in two different systems (don't ask why, it's just the way it is). When I do this, preferably I would like the two datasets to be the same size. I have a primary key on both, let's calll this "ID". What I want to do is check this ID from table1 and table2 and get the unique values (so I can go on and see why I have more in one table). My dataset gets very large (roughly a bit over 100 000 rows) which makes my VLOOKUP function in excel work extremely slow. Is there any way of solving this in excel with speed? Solutions using VBA macro, pivottables or excels built-in SQL would do fine. Using excel 2016.
Sample table:
ID_TableA ID_TableB
123456789208435989 123456789208435989
123456789239344137 123456789368934745
123456789368934745 123456789381895013
123456789381895013 123456789447760867
123456789447760867 123456789466692531
123456789466692531 123456789470807304
123456789470807304 123456789504343451
123456789504343451 123456789571573964
123456789563853210 123456789666106771
123456789571573964 123456789683792216
123456789666106771 123456789719645070
123456789683792216 123456789747751420
123456789719645070 123456789770236822
123456789747751420 123456789839975896
123456789770236822 123456789920037815
123456789825288494 123456789930612286
123456789839975896 123456789936072949
123456789920037815 123456789948401617
123456789930612286 123456789982601470
123456789936072949
123456789948401617
123456789982601470
The result from the solution should output:
123456789825288494
123456789563853210
123456789239344137
The data in the tables are 18 char long numberseries where the first 9 numbers are not changing.
Edit: Both of the two tables could contain unique values. The result should return values that are unique from both tables.
Assuming you have both these columns in separate tables on a single database, then this problem is easy to handle using SQL. Here is one way:
SELECT a.ID_TableA
FROM TableA a
LEFT JOIN TableB b
ON a.ID_TableA = b.ID_TableB
WHERE b.ID_TableB IS NULL
UNION
SELECT b.ID_TableB
FROM TableA a
RIGHT JOIN TableB b
ON a.ID_TableA = b.ID_TableB
WHERE a.ID_TableB IS NULL;
Another way, using EXISTS:
SELECT ID_TableA
FROM TableA a
WHERE NOT EXISTS (SELECT 1 FROM TableB b WHERE a.ID_TableA = b.ID_TableB)
UNION
SELECT ID_TableB
FROM TableA b
WHERE NOT EXISTS (SELECT 1 FROM TableA a WHERE a.ID_TableA = b.ID_TableB);
While I would do that with an Access query, as others suggested, here's my 2 cents for your question.
VLOOKUP IS slow and not the right function for this.
Countif is a bit better, but ISNUMBER(MATCH()) seems to be the fastest combination by far.
Have a look at https://stackoverflow.com/a/29983885/78522
You can use powerquery (Get & Transform Data):
let
SourceA = Excel.CurrentWorkbook(){[Name="tblA"]}[Content],
SourceB = Excel.CurrentWorkbook(){[Name="tblB"]}[Content],
UniqueA = Table.Join(SourceA,{"ID_TableA"},SourceB,{"ID_TableB"},JoinKind.LeftAnti),
UniqueB = Table.Join(SourceA,{"ID_TableA"},SourceB,{"ID_TableB"},JoinKind.RightAnti),
OutputList = List.Combine({UniqueA[ID_TableA], UniqueB[ID_TableB]})
in
OutputList
(Edited having seen your requirement to return unique values from EITHER table)
Doing some testing, using some mocked up data in a similar format, this seems pretty fast:
Input from tblA Rows: 250,000
Input from tblB Rows: 250,000
Start: 25/10/2018 14:17:13
End: 25/10/2018 14:17:15
Returned 41,042 unique values in about 2 seconds

Why does select result fields double data scanned in BigQuery

I have a table with 2 integer fields x,y and few millions of rows.
The fields are created with the following code:
Field.newBuilder("x", LegacySQLTypeName.INTEGER).setMode(Field.Mode.NULLABLE).build();
If I run the following from the web:
SELECT x,y FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: "Valid: This query will process 64.9 MB when run."
compared to:
SELECT x FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: " Valid: This query will process 32.4 MB when run."
It scans more than double of the original data scanned.
I would expect it will first find the relevant rows based on where clause and then bring the extra field without scanning the entire second field.
Any inputs on why it doubles the data scanned and how to avoid it will be appreciated.
In my application I have hundred of possible fields which I need to fetch for a very small number of rows (50) which answer the query.
Does this means I will need to processed all fields data?
* I'm aware how columnar database works, but wasn't aware for the huge price when you want to brings lots of fields based on a very specific where clause.
The following link provide very clear answer:
best-practices-performance-input
BigQuery does not have a concept of index or something like that. When you query a field column, BigQuery will scan through all the values of that column and then make the operations you want (for a deeper deep understanding they have some pretty cool posts about the inner workings of BQ).
That means that when you select x and y where x = 1, BQ will read through all values of x and y and then find where x = 1.
This ends up being an amazing feature of BQ, you just load your data there and it just works. It does force you to be aware on how much data you retrieve from each query. Queries of the type select * from table should be used only if you really need all columns.

SQLite Query - Need to get multiple values from multiple keys

I have a database as follows.
security_id ticker company_name
----------------------------------------------
100019 PANL UNIVERSAL DISPLAY CORP
10001 NAFC NASH FINCH CO
100030 PRVT PRIVATE MEDIA GROUP INC
100033 REFR RESEARCH FRONTIER INC
I have a list of ticker symbols like [GOOG, NAFC, AAPL, PRVT] and I want to get a list of security_id's that are associated with these ticker symbols I have in the list.
I'm new to SQL, so at first I thought of obtaining it one by one by iteration, this works but its really, so I was wondering if there is a SQL statement that can help me.
For SQL Server it would be something similar to this:
select security_id,ticker from <your table name>
where ticker in ('GOOG', 'NAFC', 'AAPL', 'PRVT')
The in takes a list of strings as parameters to compare against the ticker column. This would just be used if you were executing the t-sql in SQL Server Management Studio. If you were to break this out in to a stored procedure then you would have to pass the tickers as CSV and then create a function to split the csv in to a temp table to compare against.
Updated to include the ticker in return to know which security_id belongs to which ticker.

SQL Server 2008 Query Result Formating (changing x and y axis fields)

What is the most efficient way to format query results, wether in the actual SQL Server SQL code or using a different program such as access or excel so that the X (first row column headers) and Y Axis (first column field values) can be changed, but with the same query result data still being represented, just in a different way.
they way the data is stored in my database and they way my original query results are returned in SQL Server 2008 are as follows:
Original Results Format
And Below is the way I need to have the data look:
How I need the Results to Look
In essence, I need to have the zipcode field go down the Y Axis (first column) and the Coveragecode field to go across the top first Row (X Axis) with the exposures filling in the rest of the data.
The only way I can thing of getting this done is by bringing the data into excel and doing a bunch of V-LookUps. I tried using some pivot tables but didn't get to far. I'm going to continue trying to format the data using the V-LookUps but hopefully someone can come up with a better way of doing this.
What you are looking for is a table operator called PIVOT. Docs: http://msdn.microsoft.com/en-us/library/ms177410.aspx
WITH Src AS(
SELECT Coveragecode, Zipcode, EarnedExposers
FROM yourTable
)
SELECT Zipcode, [BI], [PD], [PIP], [UMBI], [COMP], [COLL]
FROM Src
PIVOT(MAX(EarnedExposers) FOR CoverageCode
IN(
[BI],
[PD],
[PIP],
[UMBI],
[COMP],
[COLL]
)
) AS P;