Reading multiple datasets from a single file

Reading multiple datasets from a single file - pandas

My text file has tables for each database. Is there any way that pandas can read this file and create separate dataframe for each database?
Database: ABC
+-----------------------------------------------+----------+------------+
| Tables | Columns | Total Rows |
+-----------------------------------------------+----------+------------+
| ApplicationUpdateBankLog | 13 | 0 |
| ChangeLogTemp | 12 | 1678363 |
| Sheet2$ | 10 | 359 |
| tempAllowApplications | 1 | 9 |
+-----------------------------------------------+----------+------------+
4 rows in set.
Database: XYZ
+--------------------------------------------------+----------+------------+
| Tables | Columns | Total Rows |
+--------------------------------------------------+----------+------------+
| BKP_QualificationDetails_12082014 | 14 | 7959877 |
| BillNotGeneratedCount | 11 | 2312 |
| VVshipBenefit | 19 | 197356 |
| VVBenefit_Bkup29012016 | 19 | 101318 |
+--------------------------------------------------+----------+------------+
4 rows in set.

You can use dict comprehension for creating dict of DataFrames:
import pandas as pd
from pandas.compat import StringIO
temp=u"""Database: ABC
+-----------------------------------------------+----------+------------+
| Tables | Columns | Total Rows |
+-----------------------------------------------+----------+------------+
| ApplicationUpdateBankLog | 13 | 0 |
| ChangeLogTemp | 12 | 1678363 |
| Sheet2$ | 10 | 359 |
| tempAllowApplications | 1 | 9 |
+-----------------------------------------------+----------+------------+
4 rows in set.
Database: XYZ
+--------------------------------------------------+----------+------------+
| Tables | Columns | Total Rows |
+--------------------------------------------------+----------+------------+
| BKP_QualificationDetails_12082014 | 14 | 7959877 |
| BillNotGeneratedCount | 11 | 2312 |
| VVshipBenefit | 19 | 197356 |
| VVBenefit_Bkup29012016 | 19 | 101318 |
+--------------------------------------------------+----------+------------+
4 rows in set."""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['a', 'Tables', 'Columns', 'Total Rows'])
#replace NaN in column a created where not 'Database' by forward filing
df.a = df.a.where(df.a.str.startswith('Database')).ffill()
#remove rows where NaN in Tables column
df = df.dropna(subset=['Tables'])
#remove all whitespaces, set index for selecting in dict comprehension
df = df.apply(lambda x: x.str.strip()).set_index('a')
#convert to numeric columns, replace NaN, convert to int
df['Columns'] = pd.to_numeric(df['Columns'], errors='coerce').fillna(0).astype(int)
df['Total Rows'] = pd.to_numeric(df['Total Rows'], errors='coerce').fillna(0).astype(int)
#remove rows with value Tables
df = df[df['Tables'] != 'Tables']
print (df)
Tables Columns Total Rows
a
Database: ABC ApplicationUpdateBankLog 13 0
Database: ABC ChangeLogTemp 12 1678363
Database: ABC Sheet2$ 10 359
Database: ABC tempAllowApplications 1 9
Database: XYZ BKP_QualificationDetails_12082014 14 7959877
Database: XYZ BillNotGeneratedCount 11 2312
Database: XYZ VVshipBenefit 19 197356
Database: XYZ VVBenefit_Bkup29012016 19 101318
#select in dict comprehension and reset index to default monotonic index
dfs = {x:df.loc[x].reset_index(drop=True) for x in df.index.unique()}
print (dfs['Database: ABC'])
Tables Columns Total Rows
0 ApplicationUpdateBankLog 13 0
1 ChangeLogTemp 12 1678363
2 Sheet2$ 10 359
3 tempAllowApplications 1 9
print (dfs['Database: XYZ'])
Tables Columns Total Rows
0 BKP_QualificationDetails_12082014 14 7959877
1 BillNotGeneratedCount 11 2312
2 VVshipBenefit 19 197356
3 VVBenefit_Bkup29012016 19 101318

Related

Generate new values in dataframe based on other data

I am trying to calculate an additional column in a results dataframe based on a filter operation from another dataframe that does not match in size.
So, I have my source dataframe source_df:
| id | date |
| -----------------------------
| 1 | 2100-01-01 |
| 2 | 2021-12-12 |
| 3 | 2018-09-01 |
| 4 | 2100-01-01 |
and the target dataframe target_df. Both dataframe lengths and amount of ids do not necessarily match:
| id |
| --------
| 1 |
| 2 |
| 3 |
| 4 |
| 5. |
...
| 100 |
I actually want to find out which dates lie more than 30 days in the past.
To do so, I created a query
query = (pd.datetime.today() - pd.to_datetime(source_df["date"], errors="coerce")).dt.days > 30
ids = source_df[query]["id"]
--> ids = [2,3]
My intention is to calculate a column "date_in_past"
that contains the values 0 and 1. If the date difference is greater than 30 days, a 1 is inserted, 0 elsewise.
The target_df should look like:
| id | date_in_past |
| ----------------- |
| 1 | 0 |
| 2 | 1 |
| 3 | 1 |
| 4 | 0 |
| 5 | 0 |
Both indices and df lengths do not match.
I tried to create a lambda function
map_query = lambda x: 1 if x in ids.values else 0
When I try to pass the target frame map_query(target_df["id"]) a ValueError is thrown, that "lengths must match to compare".
How can I assign the new column "date_in_past" having the calculated values based on the source dataframe?

Create new column in pandas depending on multiple conditions

I would like to create a new column based on various conditions
Let's say I have a df where column A can equal any of the following: ['Single', 'Multiple', 'Commercial', 'Domestic', 'Other'], column B has numeric values from 0-30.
I'm trying to get column C to be 'Moderate' if A = 'Single' or 'Multiple', and if it equals anything else, to consider the values in column B. If column A != 'Single' or 'Multiple', column C will equal Moderate if 3 < B > 19 and 'High' if B>=19.
I have tried various loop combinations but I can't seem to get it. Any help?
trial = []
for x in df['A']:
if x == 'Single' or x == 'Multiple':
trial.append('Moderate')
elif x != 'Single' or x != 'Multiple':
if df['B']>19:
trial.append('Test')
df['trials'] = trial
Thank you kindly,
Denisse

It will good if you provide some sample data. But with some that I created, you can see how to apply a function to each row of your DataFrame.
Data
valuesA = ['Single', 'Multiple', 'Commercial', 'Domestic', 'Other',
'Single', 'Multiple', 'Commercial', 'Domestic', 'Other']
valuesB = [0, 10, 20, 25, 30, 25, 15, 10, 5, 3 ]
df = pd.DataFrame({'A': valuesA, 'B': valuesB})
| | A | B |
|---:|:-----------|----:|
| 0 | Single | 0 |
| 1 | Multiple | 10 |
| 2 | Commercial | 20 |
| 3 | Domestic | 25 |
| 4 | Other | 30 |
| 5 | Single | 25 |
| 6 | Multiple | 15 |
| 7 | Commercial | 10 |
| 8 | Domestic | 5 |
| 9 | Other | 3 |
Function to apply
You don't specify what happen if column B is less than or equal to 3, so I suppose that C will be 'Low'. Adapt the function as you need. Also, maybe there is a typo in your question where you say '3 < B > 19', I changed to '3 < B < 19'.
def my_function(x):
if x['A'] in ['Single', 'Multiple']:
return 'Moderate'
else:
if x['B'] <= 3:
return 'Low'
elif 3 < x['B'] < 19:
return 'Moderate'
else:
return 'High'
New column
With the DataFrame and the new function you can apply it to each row with the method apply using the argument 'axis=1':
df['C'] = df.apply(my_function, axis=1)
| | A | B | C |
|---:|:-----------|----:|:---------|
| 0 | Single | 0 | Moderate |
| 1 | Multiple | 10 | Moderate |
| 2 | Commercial | 20 | High |
| 3 | Domestic | 25 | High |
| 4 | Other | 30 | High |
| 5 | Single | 25 | Moderate |
| 6 | Multiple | 15 | Moderate |
| 7 | Commercial | 10 | Moderate |
| 8 | Domestic | 5 | Moderate |
| 9 | Other | 3 | Low |

pandas pivot onto values

Given a dataframe
df=pd.DataFrame([[1,11,0],[1,12,1],[2,21,0],[2,22,1]])
df.columns = ['Key','Value','PivotOn']
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
The pivot action will give me columns 0 and 1 from the column 'PivotOn'. But I would like to always pivot onto values 0, 1 and 2, even if there might not exist a row that has PivotOn = 2 (just produce nan for it).
I cannot modify original dataframe so I'd want something like:
pivoted = df.pivot(index='Key',columns=[0,1,2],values='Value')
where it will always produce 3 columns of 0, 1 and 2 and column 2 is filled with nans.

Assume PivotOn has three unique values 0, 1, 2
df=pd.DataFrame([[1,11,0],[1,12,1],[2,21,0],[2,22,2]])
df.columns = ['Key','Value','PivotOn']
df
+---+-----+-------+---------+
| | Key | Value | PivotOn |
+---+-----+-------+---------+
| 0 | 1 | 11 | 0 |
| 1 | 1 | 12 | 1 |
| 2 | 2 | 21 | 0 |
| 3 | 2 | 22 | 2 |
+---+-----+-------+---------+
And say you need to include columns 2, 3 and 4 (you can also assume that 2 may or may not be present in original df, so generalizing)
Then go as -
expected = {2, 3, 4}
res = list(expected - set(df.PivotOn.unique()))
if len(res) > 1:
new_df = pd.DataFrame({'Key':np.NaN, 'Value':np.NaN, 'PivotOn':res}, index=range(df.shape[0], df.shape[0] + len(res)))
ndf = pd.concat([df, new_df], sort=False)
pivoted = ndf.pivot(index='Key',columns='PivotOn',values='Value').dropna(how='all')
else:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
pivoted
+---------+------+------+------+-----+-----+
| PivotOn | 0 | 1 | 2 | 3 | 4 |
+---------+------+------+------+-----+-----+
| Key | | | | | |
| 1.0 | 11.0 | 12.0 | NaN | NaN | NaN |
| 2.0 | 21.0 | NaN | 22.0 | NaN | NaN |
+---------+------+------+------+-----+-----+

You might try this if all you need is a column '2' with nan's when they do not exist in your dataframe;
def no_col_2(df):
if 2 not in df['PivotOn']:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
pivoted['2'] = np.nan
else:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
no_col_2(df)
print(pivoted)
PivotOn 0 1 2
Key
1 11 12 NaN
2 21 22 NaN

Displaying one value for multiple rows in a MultiIndexed dataframe

I'm interested in presenting the following data in pandas:
metric1 | metric 2 || % occurence | total
-----------------------------------------
A | 1 || 20 |
| 2 || 10 | 35
| 3 || 5 |
-----------------------------------------
B | 1 || 40 |
| 2 || 10 | 65
| 3 || 15 |
(For text search, I'd describe this as presenting a breakdown of a groupby together with the aggregate values of the outer level of a MultiIndex)
I can create all the columns except for the total column: assuming df is a flat table like
metric1 | metric 2 | percentage
--------------------------------
A | 1 | 20
A | 2 | 10
A | 3 | 5
B | 1 | 40
B | 2 | 10
B | 3 | 15
I can get most of what I want using
aggregate_df = df.groupby(['metric1', 'metric2']).sum()
And I can get the total values using
aggregate_df.sum(level=0)
My question is, is there any way to display them together in a single DataFrame?

With multiple index you can make it and crosstab+stack
pd.crosstab(index=df.metric1,columns=df.metric2,values=df.percentage,aggfunc='sum',margins=True).set_index('All',append=True).iloc[:-1].stack()
Out[59]:
metric1 All metric2
A 35 1 20
2 10
3 5
B 65 1 40
2 10
3 15
dtype: int64

How to show records were 3 fields match

I'm trying to write a query that will list the columns in a table when 3 specific fields are the same, but unknown:
TABLE:
FIELD 1 | FIELD 2 | FIELD 3 | FIELD 4
---------|--------------|------------|---------------
1 | 01-01-15 | 21 | 150
1 | 01-01-15 | 24 | 12
1 | 02-01-15 | 21 | 681
1 | 01-01-15 | 21 | 299
DESIRED RESULTS:
FIELD 1 | FIELD 2 | FIELD 3 | FIELD 4
-------------|--------------|-------------|------------
1 | 01-01-15 | 21 | 150
1 | 01-01-15 | 21 | 299
Sorry - still a newb here! Thanks in advance!

Count the number of rows with the same combination and filter for a count > 1:
select *
from tab
qualify
count(*)
over (partition by field1, field2, field3) > 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Reading multiple datasets from a single file - pandas

Related

Generate new values in dataframe based on other data

Create new column in pandas depending on multiple conditions

pandas pivot onto values

Displaying one value for multiple rows in a MultiIndexed dataframe

How to show records were 3 fields match

Categories

Resources