Reading multiple datasets from a single file - pandas

My text file has tables for each database. Is there any way that pandas can read this file and create separate dataframe for each database?
Database: ABC
+-----------------------------------------------+----------+------------+
| Tables | Columns | Total Rows |
+-----------------------------------------------+----------+------------+
| ApplicationUpdateBankLog | 13 | 0 |
| ChangeLogTemp | 12 | 1678363 |
| Sheet2$ | 10 | 359 |
| tempAllowApplications | 1 | 9 |
+-----------------------------------------------+----------+------------+
4 rows in set.
Database: XYZ
+--------------------------------------------------+----------+------------+
| Tables | Columns | Total Rows |
+--------------------------------------------------+----------+------------+
| BKP_QualificationDetails_12082014 | 14 | 7959877 |
| BillNotGeneratedCount | 11 | 2312 |
| VVshipBenefit | 19 | 197356 |
| VVBenefit_Bkup29012016 | 19 | 101318 |
+--------------------------------------------------+----------+------------+
4 rows in set.

You can use dict comprehension for creating dict of DataFrames:
import pandas as pd
from pandas.compat import StringIO
temp=u"""Database: ABC
+-----------------------------------------------+----------+------------+
| Tables | Columns | Total Rows |
+-----------------------------------------------+----------+------------+
| ApplicationUpdateBankLog | 13 | 0 |
| ChangeLogTemp | 12 | 1678363 |
| Sheet2$ | 10 | 359 |
| tempAllowApplications | 1 | 9 |
+-----------------------------------------------+----------+------------+
4 rows in set.
Database: XYZ
+--------------------------------------------------+----------+------------+
| Tables | Columns | Total Rows |
+--------------------------------------------------+----------+------------+
| BKP_QualificationDetails_12082014 | 14 | 7959877 |
| BillNotGeneratedCount | 11 | 2312 |
| VVshipBenefit | 19 | 197356 |
| VVBenefit_Bkup29012016 | 19 | 101318 |
+--------------------------------------------------+----------+------------+
4 rows in set."""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['a', 'Tables', 'Columns', 'Total Rows'])
#replace NaN in column a created where not 'Database' by forward filing
df.a = df.a.where(df.a.str.startswith('Database')).ffill()
#remove rows where NaN in Tables column
df = df.dropna(subset=['Tables'])
#remove all whitespaces, set index for selecting in dict comprehension
df = df.apply(lambda x: x.str.strip()).set_index('a')
#convert to numeric columns, replace NaN, convert to int
df['Columns'] = pd.to_numeric(df['Columns'], errors='coerce').fillna(0).astype(int)
df['Total Rows'] = pd.to_numeric(df['Total Rows'], errors='coerce').fillna(0).astype(int)
#remove rows with value Tables
df = df[df['Tables'] != 'Tables']
print (df)
Tables Columns Total Rows
a
Database: ABC ApplicationUpdateBankLog 13 0
Database: ABC ChangeLogTemp 12 1678363
Database: ABC Sheet2$ 10 359
Database: ABC tempAllowApplications 1 9
Database: XYZ BKP_QualificationDetails_12082014 14 7959877
Database: XYZ BillNotGeneratedCount 11 2312
Database: XYZ VVshipBenefit 19 197356
Database: XYZ VVBenefit_Bkup29012016 19 101318
#select in dict comprehension and reset index to default monotonic index
dfs = {x:df.loc[x].reset_index(drop=True) for x in df.index.unique()}
print (dfs['Database: ABC'])
Tables Columns Total Rows
0 ApplicationUpdateBankLog 13 0
1 ChangeLogTemp 12 1678363
2 Sheet2$ 10 359
3 tempAllowApplications 1 9
print (dfs['Database: XYZ'])
Tables Columns Total Rows
0 BKP_QualificationDetails_12082014 14 7959877
1 BillNotGeneratedCount 11 2312
2 VVshipBenefit 19 197356
3 VVBenefit_Bkup29012016 19 101318

Related

Generate new values in dataframe based on other data

I am trying to calculate an additional column in a results dataframe based on a filter operation from another dataframe that does not match in size.
So, I have my source dataframe source_df:
| id | date |
| -----------------------------
| 1 | 2100-01-01 |
| 2 | 2021-12-12 |
| 3 | 2018-09-01 |
| 4 | 2100-01-01 |
and the target dataframe target_df. Both dataframe lengths and amount of ids do not necessarily match:
| id |
| --------
| 1 |
| 2 |
| 3 |
| 4 |
| 5. |
...
| 100 |
I actually want to find out which dates lie more than 30 days in the past.
To do so, I created a query
query = (pd.datetime.today() - pd.to_datetime(source_df["date"], errors="coerce")).dt.days > 30
ids = source_df[query]["id"]
--> ids = [2,3]
My intention is to calculate a column "date_in_past"
that contains the values 0 and 1. If the date difference is greater than 30 days, a 1 is inserted, 0 elsewise.
The target_df should look like:
| id | date_in_past |
| ----------------- |
| 1 | 0 |
| 2 | 1 |
| 3 | 1 |
| 4 | 0 |
| 5 | 0 |
Both indices and df lengths do not match.
I tried to create a lambda function
map_query = lambda x: 1 if x in ids.values else 0
When I try to pass the target frame map_query(target_df["id"]) a ValueError is thrown, that "lengths must match to compare".
How can I assign the new column "date_in_past" having the calculated values based on the source dataframe?

Create new column in pandas depending on multiple conditions

I would like to create a new column based on various conditions
Let's say I have a df where column A can equal any of the following: ['Single', 'Multiple', 'Commercial', 'Domestic', 'Other'], column B has numeric values from 0-30.
I'm trying to get column C to be 'Moderate' if A = 'Single' or 'Multiple', and if it equals anything else, to consider the values in column B. If column A != 'Single' or 'Multiple', column C will equal Moderate if 3 < B > 19 and 'High' if B>=19.
I have tried various loop combinations but I can't seem to get it. Any help?
trial = []
for x in df['A']:
if x == 'Single' or x == 'Multiple':
trial.append('Moderate')
elif x != 'Single' or x != 'Multiple':
if df['B']>19:
trial.append('Test')
df['trials'] = trial
Thank you kindly,
Denisse
It will good if you provide some sample data. But with some that I created, you can see how to apply a function to each row of your DataFrame.
Data
valuesA = ['Single', 'Multiple', 'Commercial', 'Domestic', 'Other',
'Single', 'Multiple', 'Commercial', 'Domestic', 'Other']
valuesB = [0, 10, 20, 25, 30, 25, 15, 10, 5, 3 ]
df = pd.DataFrame({'A': valuesA, 'B': valuesB})
| | A | B |
|---:|:-----------|----:|
| 0 | Single | 0 |
| 1 | Multiple | 10 |
| 2 | Commercial | 20 |
| 3 | Domestic | 25 |
| 4 | Other | 30 |
| 5 | Single | 25 |
| 6 | Multiple | 15 |
| 7 | Commercial | 10 |
| 8 | Domestic | 5 |
| 9 | Other | 3 |
Function to apply
You don't specify what happen if column B is less than or equal to 3, so I suppose that C will be 'Low'. Adapt the function as you need. Also, maybe there is a typo in your question where you say '3 < B > 19', I changed to '3 < B < 19'.
def my_function(x):
if x['A'] in ['Single', 'Multiple']:
return 'Moderate'
else:
if x['B'] <= 3:
return 'Low'
elif 3 < x['B'] < 19:
return 'Moderate'
else:
return 'High'
New column
With the DataFrame and the new function you can apply it to each row with the method apply using the argument 'axis=1':
df['C'] = df.apply(my_function, axis=1)
| | A | B | C |
|---:|:-----------|----:|:---------|
| 0 | Single | 0 | Moderate |
| 1 | Multiple | 10 | Moderate |
| 2 | Commercial | 20 | High |
| 3 | Domestic | 25 | High |
| 4 | Other | 30 | High |
| 5 | Single | 25 | Moderate |
| 6 | Multiple | 15 | Moderate |
| 7 | Commercial | 10 | Moderate |
| 8 | Domestic | 5 | Moderate |
| 9 | Other | 3 | Low |

pandas pivot onto values

Given a dataframe
df=pd.DataFrame([[1,11,0],[1,12,1],[2,21,0],[2,22,1]])
df.columns = ['Key','Value','PivotOn']
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
The pivot action will give me columns 0 and 1 from the column 'PivotOn'. But I would like to always pivot onto values 0, 1 and 2, even if there might not exist a row that has PivotOn = 2 (just produce nan for it).
I cannot modify original dataframe so I'd want something like:
pivoted = df.pivot(index='Key',columns=[0,1,2],values='Value')
where it will always produce 3 columns of 0, 1 and 2 and column 2 is filled with nans.
Assume PivotOn has three unique values 0, 1, 2
df=pd.DataFrame([[1,11,0],[1,12,1],[2,21,0],[2,22,2]])
df.columns = ['Key','Value','PivotOn']
df
+---+-----+-------+---------+
| | Key | Value | PivotOn |
+---+-----+-------+---------+
| 0 | 1 | 11 | 0 |
| 1 | 1 | 12 | 1 |
| 2 | 2 | 21 | 0 |
| 3 | 2 | 22 | 2 |
+---+-----+-------+---------+
And say you need to include columns 2, 3 and 4 (you can also assume that 2 may or may not be present in original df, so generalizing)
Then go as -
expected = {2, 3, 4}
res = list(expected - set(df.PivotOn.unique()))
if len(res) > 1:
new_df = pd.DataFrame({'Key':np.NaN, 'Value':np.NaN, 'PivotOn':res}, index=range(df.shape[0], df.shape[0] + len(res)))
ndf = pd.concat([df, new_df], sort=False)
pivoted = ndf.pivot(index='Key',columns='PivotOn',values='Value').dropna(how='all')
else:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
pivoted
+---------+------+------+------+-----+-----+
| PivotOn | 0 | 1 | 2 | 3 | 4 |
+---------+------+------+------+-----+-----+
| Key | | | | | |
| 1.0 | 11.0 | 12.0 | NaN | NaN | NaN |
| 2.0 | 21.0 | NaN | 22.0 | NaN | NaN |
+---------+------+------+------+-----+-----+
You might try this if all you need is a column '2' with nan's when they do not exist in your dataframe;
def no_col_2(df):
if 2 not in df['PivotOn']:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
pivoted['2'] = np.nan
else:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
no_col_2(df)
print(pivoted)
PivotOn 0 1 2
Key
1 11 12 NaN
2 21 22 NaN

Displaying one value for multiple rows in a MultiIndexed dataframe

I'm interested in presenting the following data in pandas:
metric1 | metric 2 || % occurence | total
-----------------------------------------
A | 1 || 20 |
| 2 || 10 | 35
| 3 || 5 |
-----------------------------------------
B | 1 || 40 |
| 2 || 10 | 65
| 3 || 15 |
(For text search, I'd describe this as presenting a breakdown of a groupby together with the aggregate values of the outer level of a MultiIndex)
I can create all the columns except for the total column: assuming df is a flat table like
metric1 | metric 2 | percentage
--------------------------------
A | 1 | 20
A | 2 | 10
A | 3 | 5
B | 1 | 40
B | 2 | 10
B | 3 | 15
I can get most of what I want using
aggregate_df = df.groupby(['metric1', 'metric2']).sum()
And I can get the total values using
aggregate_df.sum(level=0)
My question is, is there any way to display them together in a single DataFrame?
With multiple index you can make it and crosstab+stack
pd.crosstab(index=df.metric1,columns=df.metric2,values=df.percentage,aggfunc='sum',margins=True).set_index('All',append=True).iloc[:-1].stack()
Out[59]:
metric1 All metric2
A 35 1 20
2 10
3 5
B 65 1 40
2 10
3 15
dtype: int64

How to show records were 3 fields match

I'm trying to write a query that will list the columns in a table when 3 specific fields are the same, but unknown:
TABLE:
FIELD 1 | FIELD 2 | FIELD 3 | FIELD 4
---------|--------------|------------|---------------
1 | 01-01-15 | 21 | 150
1 | 01-01-15 | 24 | 12
1 | 02-01-15 | 21 | 681
1 | 01-01-15 | 21 | 299
DESIRED RESULTS:
FIELD 1 | FIELD 2 | FIELD 3 | FIELD 4
-------------|--------------|-------------|------------
1 | 01-01-15 | 21 | 150
1 | 01-01-15 | 21 | 299
Sorry - still a newb here! Thanks in advance!
Count the number of rows with the same combination and filter for a count > 1:
select *
from tab
qualify
count(*)
over (partition by field1, field2, field3) > 1