Panda's dataframe not throwing error for rows containing lesser fields - pandas

I am facing one issue when reading rows containing lesser fields my dataset looks like below.
"Source1"~"schema1"~"table1"~"modifiedon"~"timestamp"~"STAGE"~15~NULL~NULL~FALSE~FALSE~TRUE
"Source1"~"schema2"~"table2"
and I am running below command to read the dataset.
tabdf = pd.read_csv('test_table.csv',sep='~',header = None)
But its not throwing any error though its supposed too.
The version we are using
pip show pandas
Name: pandas
Version: 1.0.1
My question is how to make the process failed if we will get incorrect data structure.

You could inspect the data first using Pandas and then either fail the process if bad data exists or just read the known-good rows.
Read full rows into a dataframe
df = pd.read_fwf('test_table.csv', widths=[999999], header=None)
print(df)
0
0 "Source1"~"schema1"~"table1"~"modifiedon"~"tim...
1 "Source1"~"schema2"~"table2"
Count number of separators
sep_count = df[0].str.count('~')
sep_count
0 11
1 2
Maybe just terminate the process if bad (short) rows
If the number of unique values are not 1.
sep_count.nunique()
2
Or just read the good rows
good_rows = sep_count.eq(11) # if you know what separator count should be. Or ...
good_rows = sep_count.eq(sep_count.max()) # if you know you have at least 1 good row
df = pd.read_csv('test_table.csv', sep='~', header=None).loc[good_rows]
print(df)
Result
0 1 2 3 4 5 6 7 8 9 10 11
0 Source1 schema1 table1 modifiedon timestamp STAGE 15.0 NaN NaN False False True

Related

Convert transactions with several products from columns to row [duplicate]

I'm having a very tough time trying to figure out how to do this with python. I have the following table:
NAMES VALUE
john_1 1
john_2 2
john_3 3
bro_1 4
bro_2 5
bro_3 6
guy_1 7
guy_2 8
guy_3 9
And I would like to go to:
NAMES VALUE1 VALUE2 VALUE3
john 1 2 3
bro 4 5 6
guy 7 8 9
I have tried with pandas, so I first split the index (NAMES) and I can create the new columns but I have trouble indexing the values to the right column.
Can someone at least give me a direction where the solution to this problem is? I don't expect a full code (I know that this is not appreciated) but any help is welcome.
After splitting the NAMES column, use .pivot to reshape your DataFrame.
# Split Names and Pivot.
df['NAME_NBR'] = df['NAMES'].str.split('_').str.get(1)
df['NAMES'] = df['NAMES'].str.split('_').str.get(0)
df = df.pivot(index='NAMES', columns='NAME_NBR', values='VALUE')
# Rename columns and reset the index.
df.columns = ['VALUE{}'.format(c) for c in df.columns]
df.reset_index(inplace=True)
If you want to be slick, you can do the split in a single line:
df['NAMES'], df['NAME_NBR'] = zip(*[s.split('_') for s in df['NAMES']])

pandas dataframe - how to find multiple column names with minimum values

I have a dataframe (small sample shown below, it has more columns), and I want to find the column names with the minimum values.
Right now, I have the following code to deal with it:
finaldf['min_pillar_score'] = finaldf.iloc[:, 2:9].idxmin(axis="columns")
This works fine, but does not return multiple values of column names in case there is more than one instance of minimum values. How can I change this to return multiple column names in case there is more than one instance of the minimum value?
Please note, I want row wise results, i.e. minimum column names for each row.
Thanks!
try the code below and see if it's in the output format you'd anticipated. it produces the intended result at least.
result will be stored in mins.
mins = df.idxmin(axis="columns")
for i, r in df.iterrows():
mins[i] = list(r[r == r[mins[i]]].index)
Get column name where value is something in pandas dataframe might be helpful also.
EDIT: adding an image of the output and the full code context.
Assuming this input as df:
A B C D
0 5 8 9 5
1 0 0 1 7
2 6 9 2 4
3 5 2 4 2
4 4 7 7 9
You can use the underlying numpy array to get the min value, then compare the values to the min and get the columns that have a match:
s = df.eq(df.to_numpy().min()).any()
list(s[s].index)
output: ['A', 'B']

Removing values of a certain object type from a dataframe column in Pandas

I have a pandas dataframe where some values are integers and other values are an array. I simply want to drop all of the rows that contain the array (object datatype I believe) in my "ORIGIN_AIRPORT_ID" column, but I have not been able to figure out how to do so after trying many methods.
Here is what the first 20 rows of my dataframe looks like. The values that show up like a list are the ones I want to remove. The dataset is a couple million rows, so I just need to write code that removes all of the array-like values in that specific dataframe column if that makes sense.
df = df[df.origin_airport_ID.str.contains(',') == False]
You should consider next time giving us a data sample in text, instead of a figure. It's easier for us to test your example.
Original data:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
3 20194149 [10397, 10398, 10399, 10400]
4 20194150 10397
In your case, you can use the .to_numeric pandas function:
df['ORIGIN_AIRPORT_ID'] = pd.to_numeric(df['ORIGIN_AIRPORT_ID'], errors='coerce')
It replaces every cell that cannot be converted into a number to a NaN ( Not a Number ), so we get:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397.0
1 20194147 10397.0
2 20194148 10397.0
3 20194149 NaN
4 20194150 10397.0
To remove these rows now just use .dropna
df = df.dropna().astype('int')
Which results in your desired DataFrame
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
4 20194150 10397

Copy a column value from another dataframe based on a condition

Let us say I have two dataframes: df1 and df2. Assume the following initial values.
df1=pd.DataFrame({'ID':['ASX-112','YTR-789','ASX-124','UYT-908','TYE=456','ERW-234','UUI-675','GHV-805','NMB-653','WSX-123'],
'Costperlb':[4515,5856,3313,9909,8980,9088,6765,3456,9012,1237]})
df2=df1[df1['Costperlb']>4560]
As you can see, df2 is a proper subset of df1 (it was created from df1 by imposing a condition on selection of rows).
I added a column to df2, which contains certain values based on a calculation. Let us call this df2['grade'].
df2['grade']=[1,4,3,5,1,1]
df1 and df2 contain one column named 'ID' which is guaranteed to be unique in each dataframe.
I want to:
Create a new column in df1 and initialize it to 0. Easy. df1['grade']=0.
Copy df2['grade'] values over to df1['grade'], ensuring that df1['ID']=df2['ID'] for each such copy.
The result should be the grade values for the corresponding IDs copied over.
Step 2 is what is perplexing me a bit. A naive df1['grade']=df2['grade'].values does not work obviously as the lengths of the two dataframes is different.
Now, if I think hard enough, I could possibly come up with a monstrosity like:
df1['grade'].loc[(df1['ID'].isin(df2)) & ...] but I am uncomfortable with doing that.
I am a newbie with python, and furthermore, the indices of df1 are being used elsewhere after this assignment, and I do not want drop indices, reset indices as some of the solutions are suggested in some of the search results I found.
I just want to find out rows in df1 where the 'ID' row matches the 'ID' row in df2, and then copy the 'grade' column value in that specific row over. How do I do this?
Your code:
df1=pd.DataFrame({'ID':['ASX-112','YTR-789','ASX-124','UYT-908','TYE=456','ERW-234','UUI-675','GHV-805','NMB-653','WSX-123'],
'Costperlb':[4515,5856,3313,9909,8980,9088,6765,3456,9012,1237]})
df2=df1[df1['Costperlb']>4560]
df2['grade']=[1,4,3,5,1,1]
You can use merge with "left". In this way the indexing of df1 is preserved:
new_df = df1.merge(df2[["ID","grade"]], on="ID", how="left")
new_df["grade"] = new_df["grade"].fillna(0)
new_df
Output:
ID Costperlb grade
0 ASX-112 4515 0.0
1 YTR-789 5856 1.0
2 ASX-124 3313 0.0
3 UYT-908 9909 4.0
4 TYE=456 8980 3.0
5 ERW-234 9088 5.0
6 UUI-675 6765 1.0
7 GHV-805 3456 0.0
8 NMB-653 9012 1.0
9 WSX-123 1237 0.0
Here I called the merged dataframe new_df, but you can simply change it to df1.
EDIT
If instead of 0 you want to replace the NaN with a string, try this:
new_df = df1.merge(df2[["ID","grade"]], on="ID", how="left")
new_df["grade"] = new_df["grade"].fillna("No transaction possible")
new_df
Output:
ID Costperlb grade
0 ASX-112 4515 No transaction possible
1 YTR-789 5856 1
2 ASX-124 3313 No transaction possible
3 UYT-908 9909 4
4 TYE=456 8980 3
5 ERW-234 9088 5
6 UUI-675 6765 1
7 GHV-805 3456 No transaction possible
8 NMB-653 9012 1
9 WSX-123 1237 No transaction possible

combine csv files, sort them by time and average the colums

I have many datasets in csv files they look like in the picture that I attached.
In the first column is always the time in minutes, but the time steps and the total number of rows differ between the raw data files. I'd like to have one output file (csv file) in which all the raw files are combined and sorted by the time. So that the time increases from the top to the bottom of the column.
raw data and output
The concentration column should be averaged, when more than one number exists.
I tried like this:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
d1 = pd.read_csv('dat1.txt', sep="\t")
d2 = pd.read_csv('dat2.txt', sep="\t")
d1.columns
d2.columns
merged_outer = pd.merge(d1,d2, on='time', how='outer')
print merged_outer
but it doesn't lead to the correct output. I'm a beginner in Pandas but I hope I explaind the problem well enough. Thank you for any idea or suggestion!
Thank you for your idea. Unfortunately, when I run it I get an error message saying that dat1.txt doesn't exist. This seems strange to me as I read the raw files initially by:
d1 = pd.read_csv('dat1.txt', sep="\t")
d2 = pd.read_csv('dat2.txt', sep="\t")
Sorry, here the data as raw text:
raw data 1
time column2 column3 concentration
1 2 4 3
2 2 4 6
4 2 4 2
7 2 4 5
raw data 2
time column2 column3 concentration
1 2 4 6
2 2 4 2
8 2 4 9
10 2 4 5
12 2 4 7
Something like this might work
filenames = ['dat1.txt', 'dat2.txt',...]
dataframes = {filename: pd.read_csv(filename, sep="\t") for filename in filenames}
merged_outer = pd.concat(dataframes).groupby('time').mean()
When you pass a dict to pd.concat, it creates a MultiIndex DataFrame with the dict keys as level0