I'm looking for a way to code a numpy select statement that does the same as an SQL case statement. I have a dataframe name df1 with the following columns:
up1, up2, sc1, sc2, st1, st2
My SQL script would look like:
CASE sc1
when "UP_MJB" then st1
when "UP_MSCI" then st2
else ""
How do I code it using np.select? Any help would be greatly appreciated.
Let's assume df be the pandas dataframe that has your data
conds = [(df['sc1']=='UP_MJB'),(df['sc1']=='UP_MSCI')]
actions = [df['st1'],df['st2']]
df['new_col'] = np.select(conds,actions,default=df['sc1'])
default parameter is used if none of the case is satisfied. In this example, it'll retain the value of col 'sc1'.
Refer https://numpy.org/doc/stable/reference/generated/numpy.select.html for more info.
Related
I want to add a new column to a datframe "table" (name: conc) which uses the values in columns (plate, ab) to get the numeric value from the dataframe "concs"
Below is what I mean, with the dataframe "exp" used to show what I expect the data to look like
what is the proper way to do this. Is it using some multiple condition, or do I need to reshape the concs dataframe somehow?
Use DataFrame.melt with left join for new column concs, if no match is created NaNs:
exp = concs.melt('plate', var_name='ab', value_name='concs').merge(table,on=['plate', 'ab'], how='left')
Solution should be simplify - if same columns names 'plate', 'ab' in both DataFrames and need merge by both is possible omit on parameter:
exp = concs.melt('plate', var_name='ab', value_name='concs').merge(table, how='left')
First melt the concs dataframe and then merge with table:
out = concs.melt(id_vars=['plate'],
value_vars=concs.columns.drop('plate').tolist(),
var_name='ab').merge(table, on=['plate', 'ab'
]).rename(columns={'value': 'concs'})
or just make good use of parameters of melt like in jezraels' answer:
out = concs.melt(id_vars=['plate'],
value_name='concs',
var_name='ab').merge(table, on=['plate', 'ab'])
I have recently been asked to do a count of all the cells in some tables that are not NULL and not empty/blank.
The issue is, I have about 80 tables and some of those tables have dozens of columns and others have hundreds of columns.
Is there a query I could use to count all cells from all columns that fit a specific criteria (in this case not NULL and not empty/blank)?
I have done some searching and it seems most answers revolve around single columns or tables that only have like 3-5 columns.
Thanks!
Try connecting SQL with pandas using pymysql or pyodbc connector and then iterate over each column using for loop and apply the count function on it.
import pymysql
import pandas as pd
import numpy as np
con = pymysql.connect('[host name]', '[user name]','[your password]', '[database name]')
cursor = con.cursor()
df = pd.read_sql('select * from [table name]',con) # SQL converted to pandas dataframe
print(df)
for col in df.columns: # loops through column
count_ = df[col].count()
print(count_) # returns count for non-nan values
I have a relatively large table with thousands of rows and few tens of columns. Some columns are meta data and others are numerical values. The problem I have is, some meta data columns are incomplete or partial that is, it missed the string after a ":". I want to get a count of how many of these are with the missing part after the colon mark.
If you look at the miniature example below, what I should get is a small table telling me that in group A, MetaData is complete for 2 entries and incomplete (missing after ":") in other 2 entries. Ideally I also want to get some statistics on SomeValue (Count, max, min etc.).
How do I do it in an SQL query or in Python Pandas?
Might turn out to be simple to use some build in function however, I am not getting it right.
Data:
Group MetaData SomeValue
A AB:xxx 20
A AB: 5
A PQ:yyy 30
A PQ: 2
Expected Output result:
Group MetaDataComplete Count
A Yes 2
A No 2
No reason to use split functions (unless the value can contain a colon character.) I'm just going to assume that the "null" values (not technically the right word) end with :.
select
"Group",
case when MetaData like '%:' then 'Yes' else 'No' end as MetaDataComplete,
count(*) as "Count"
from T
group by "Group", case when MetaData like '%:' then 'Yes' else 'No' end
You could also use right(MetaData, 1) = ':'.
Or supposing that values can contain their own colons, try charindex(':', MetaData) = len(MetaData) if you just want to ask whether the first colon is in the last position.
Here is an example:
## 1- Create Dataframe
In [1]:
import pandas as pd
import numpy as np
cols = ['Group', 'MetaData', 'SomeValue']
data = [['A', 'AB:xxx', 20],
['A', 'AB:', 5],
['A', 'PQ:yyy', 30],
['A', 'PQ:', 2]
]
df = pd.DataFrame(columns=cols, data=data)
# 2- New data frame with split value columns
new = df["MetaData"].str.split(":", n = 1, expand = True)
df["MetaData_1"]= new[0]
df["MetaData_2"]= new[1]
# 3- Dropping old MetaData columns
df.drop(columns =["MetaData"], inplace = True)
## 4- Replacing empty string by nan and count them
df.replace('',np.NaN, inplace=True)
df.isnull().sum()
Out [1]:
Group 0
SomeValue 0
MetaData_1 0
MetaData_2 2
dtype: int64
From a SQL perspective, performing a split is painful, not mention using the split results in having to perform the query first then querying the results:
SELECT
Results.[Group],
Results.MetaData,
Results.MetaValue,
COUNT(Results.MetaValue)
FROM (SELECT
[Group]
MetaData,
SUBSTRING(MetaData, CHARINDEX(':', MetaData) + 1, LEN(MetaData)) AS MetaValue
FROM VeryLargeTable) AS Results
GROUP BY Results.[Group],
Results.MetaData,
Results.MetaValue
If your just after a count, you could also try the algorithmic approach. Just loop over the data and use regular expressions with negative lookahead.
import pandas as pd
import re
pattern = '.*:(?!.)' # detects the strings of the missing data form
missing = 0
not_missing = 0
for i in data['MetaData'].tolist():
match = re.findall(pattern, i)
if match:
missing += 1
else:
not_missing += 1
In pandas Dataframe/Series there's a .isnull() method. Is there something similar in the syntax of where= filter of the select method of HDFStore?
WORKAROUND SOLUTION:
The /meta section of a data column inside hdf5 can be used as a hack solution:
import pandas as pd
store = pd.HDFStore('store.h5')
print(store.groups)
non_null = list(store.select("/df/meta/my_data_column/meta"))
df = store.select('df', where='my_data_column == non_null')
I am attempting to dynamically create a new column based on the values of another column.
Say I have the following dataframe
A|B
11|1
22|0
33|1
44|1
55|0
I want to create a new column.
If the value of column B is 1, insert 'Y' else insert 'N'.
The resulting dataframe should looks like so:
A|B|C
11|1|Y
22|0|N
33|1|Y
44|1|Y
55|0|N
I could do this by iterating through the column values,
for i in dataframe['B'].values:
if i==1:
add Y to Series
else:
add N to Series
dataframe['C'] = Series
However I am afraid this will severely reduce performance especially since my dataset contains 500,000+ rows.
Any help will be greatly appreciated.
Thank you.
Avoid chained indexing by using loc. There are some subtleties with returning a view versus a copy in pandas that are related to numpy
df['C'] = 'N'
df.loc[df.B == 1, 'C'] = 'Y'
Try this:
df['C'] = 'N'
df['C'][df['B']==1] = 'Y'
should be faster.