How to read a very messy .txt file using pd.read.csv() with multiple conditions - pandas

I have a very messy .txt file that I'm attempting to read in using pd.read_csv(). The file has multiple challenges to overcome 1) The first 12 lines are not needed and therefore need to be skipped, the next 50 rows are needed, the next 14 rows need to be Skipped, next 50 rows needed, next 14 to be skipped , and so on. 2) Each normal row of data actually exists across 2 rows of data in this report, meaning that we need to lift the 2nd row of data up to the 1st row of data and place it to the right in new columns. (This action would halve the number of total rows and double the number of columns of the desired dataframe) 3) The last challenge is that the first row of data has 8 spaces of seperation between values while the 2 row of data has anywhere from 8 through to 17 spaces of sep between values.
I thought the best way to approach this would be to first remove the rows that I don't need. I would then find way to merge row 1 with row2 / row 3 with row 4/ row 5 with row 6 until all rows are correctly consolidated. I would then use the 'sep' function to separate values of each row for anything that has 8 spaces and over. This would hopefully get to my desired Output - has anyone ever had a similar challenge that they have overcome?
First picture is an image of the raw data
Second picture is my ideal output

Ok, so the error_bad_lines=False combined with sep = '\s+|\^+' worked a treat.
I then solved the problem of bad lines by removing them one by one.
I then solved the '1 row over 2 rows' problem by splitting the dataframe into two dfs (df8,df9) and recombined them on axis=1. Looks perfect now.
import pandas as pd #importing Pandas Package to wrangle data
boltcogs = 'ABAPlist.txt'
df = pd.read_csv(boltcogs,skiprows=12,error_bad_lines=False,header = None ,sep = '\s+|\^+')
df1 = df[df.iloc[:,0] != 'Production' ] ## removing verbose lines
df2 = df1[df1.iloc[:,0] != '----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------' ]
df3 = df2[df2.iloc[:,0] != 'Kuala' ] ## removing bad rows
df4 = df3[df3.iloc[:,0] != 'Operating' ] ## removing bad rows
df5 = df4[df4.iloc[:,0] != 'Plant:' ] ## removing bad rows
df6 = df5[df5.iloc[:,0] != 'Costing' ] ## removing bad rows
df7 = df6[df6.iloc[:,0] != 'Currency:' ] ## removing bad rows
df8 = df7.iloc[0::2, :].reset_index() # Selecting every second row to get second half of row
df9 = df7.iloc[1::2, :].reset_index()# Selecting remainder to to get first half of row
df10 = pd.concat([df8, df9], axis=1, ignore_index=True) # joining them together

Related

How to make dataframe from different parts of an Excel sheet given specific keywords?

I have one Excel file where multiple tables are placed in same sheet. My requirement is to read certain tables based on keyword. I have read tables using skip rows and nrows method, which is working as of now, but in future it won't work due to dynamic table length.
Is there any other workaround apart from skip rows & nrows method to read table as shown in picture?
I want to read data1 as one table & data2 as another table. Out of which in particular I want columns "RR","FF" & "WW" as two different data frames.
Appreciate if some one can help or guide to do this.
Method I have tried:
all_files=glob.glob(INPATH+"*sample*")
df1 = pd.read_excel(all_files[0],skiprows=11,nrows= 3)
df2 = pd.read_excel(all_files[0],skiprows=23,nrows= 3)
This works fine, the only problem is table length will vary every time.
With an Excel file identical to the one of your image, here is one way to do it:
import pandas as pd
df = pd.read_excel("file.xlsx").dropna(how="all").reset_index(drop=True)
# Setup
targets = ["Data1", "Data2"]
indices = [df.loc[df["Unnamed: 0"] == target].index.values[0] for target in targets]
dfs = []
for i in range(len(indices)):
# Slice df starting from first indice to second one
try:
data = df.loc[indices[i] : indices[i + 1] - 1, :]
except IndexError:
data = df.loc[indices[i] :, :]
# For one slice, get only values where row starts with 'rr'
r_idx = data.loc[df["Unnamed: 0"] == "rr"].index.values[0]
data = data.loc[r_idx:, :].reset_index(drop=True).dropna(how="all", axis=1)
# Cleanup
data.columns = data.iloc[0]
data.columns.name = ""
dfs.append(data.loc[1:, :].iloc[:, 0:3])
And so:
for item in dfs:
print(item)
# Output
rr ff ww
1 car1 1000000 sellout
2 car2 1500000 to be sold
3 car3 1300000 sellout
rr ff ww
1 car1 1000000 sellout
2 car2 1500000 to be sold
3 car3 1300000 sellout

Pandas dividing filtered column from df 1 by filtered column of df 2 warning and weird behavior

I have a data frame which is conditionally broken up into two separate dataframes as follows:
df = pd.read_csv(file, names)
df = df.loc[df['name1'] == common_val]
df1 = df.loc[df['name2'] == target1]
df2 = df.loc[df['name2'] == target2]
# each df has a 'name3' I want to perform a division on after this filtering
The original df is filtered by a value shared by the two dataframes, and then each of the two new dataframes are further filtered by another shared column.
What I want to work:
df1['name3'] = df1['name3']/df2['name3']
However, as many questions have pointed out, this causes a setting with copy warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I tried what was recommended in this question:
df1.loc[:,'name3'] = df1.loc[:,'name3'] / df2.loc[:,'name3']
# also tried:
df1.loc[:,'name3'] = df1.loc[:,'name3'] / df2['name3']
But in both cases I still get weird behavior and the set by copy warning.
I then tried what was recommended in this answer:
df.loc[df['name2']==target1, 'name3'] = df.loc[df['name2']==target1, 'name3']/df.loc[df['name2'] == target2, 'name3']
which still results in the same copy warning.
If possible I would like to avoid copying the data frame to get around this because of the size of these dataframes (and I'm already somewhat wastefully making two almost identical dfs from the original).
If copying is the best way to go with this problem I'm interested to hear why that works over all the options I explored above.
Edit: here is a simple data frame along the lines of what df would look like after the line df.loc[df['name1'] == common_val]
name1 other1 other2 name2 name3
a x y 1 2
a x y 1 4
a x y 2 5
a x y 2 3
So if target1=1 and target2=2,
I would like df1 to contain only rows where name1=1 and df2 to contain only rows where name2=2, then divide the resulting df1['name3'] by the resulting df2['name3'].
If there is a less convoluted way to do this (without splitting the original df) I'm open to that as well!

pandas vertical concat not working as expected

I have 2 dataframes which I am trying to merge/stack vertically. First dataframe has 25 columns and second dataframe has 13 columns, out of which I only want to select 1.
When I execute the code, I get more records than expected.
I dont understand where does the problem lie.
To understand this, I tried loading the data again in a fresh pandas dataframe.
input_df = pd.read_csv()
print(input_df.shape)
(8809, 11)
filtered_df = input_df[input_df['label'] != -1] # try to filter based on label column
print(filtered_df.shape)
(6603, 11)
But when I simply print the filtered_df, I can still see 8809 records.

Sum pandas columns, excluding some rows based on other column values

I'm attempting to determine the number of widget failures from a test population.
Each widget can fail in 0, 1, or multiple ways. I'd like to calculate the number of failures of for each failure method, but once a widget is known to have failed, it should be excluded from future sums. In other words, the failure modes are known and ordered. If a widget fails via mode 1 and mode 3, I don't care about mode 3: I just want to count mode 1.
I have a dataframe with one row per item, and one column per failure mode. If the widget fails in that mode, the column value is 1, else it is 0.
d = {"item_1":
{"failure_1":0, "failure_2":0},
"item_2":
{"failure_1":1, "failure_2":0},
"item_3":
{"failure_1":0, "failure_2":1},
"item_4":
{"failure_1":1, "failure_2":1}}
df = pd.DataFrame(d).T
display(df)
Output:
failure_1 failure_2
item_1 0 0
item_2 1 0
item_3 0 1
item_4 1 1
If I just want to sum the columns, that's easy: df.sum(). And if I want to calculate percentage failures, easy too: df.sum()/len(df). But this counts widgets that fail in multiple ways, multiple times. For the problem stated, the best I can come up with is this:
# create empty df to store results
df2 = pd.DataFrame(columns=["total_failures"])
for col in df.columns:
# create a row, named after the column, and assign it the value of the sum
df2.loc[col] = df[col].sum()
# drop rows in the df column that are equal to 1
df = df.loc[df[col] != 1]
display(df2)
Output:
total_failures
failure_1 2
failure_2 1
This requires creating another dataframe (that's fine), but also requires iterating over the existing dataframe columns and deleting it a couple of rows at a time. If the dataframe takes a while to generate, or is needed for future calculations, this is not workable. I can deal with iterating over the columns.
Is there a way to do this without deleting the original df, or making a temporary copy? (Not workable with large data sets.)
You can do a cumsum on axis=1 and wherever the value is greater than 1 , mask it as 0 and then take sum:
out = df.mask(df.cumsum(axis=1).gt(1), 0).sum().to_frame('total_failures')
print(out)
total_failures
failure_1 2
failure_2 1
This way the original df is retained too.

Read from the specific lines of a csv in pandas [duplicate]

I'm having trouble figuring out how to skip n rows in a csv file but keep the header which is the 1 row.
What I want to do is iterate but keep the header from the first row. skiprows makes the header the first row after the skipped rows. What is the best way of doing this?
data = pd.read_csv('test.csv', sep='|', header=0, skiprows=10, nrows=10)
You can pass a list of row numbers to skiprows instead of an integer.
By giving the function the integer 10, you're just skipping the first 10 lines.
To keep the first row 0 (as the header) and then skip everything else up to row 10, you can write:
pd.read_csv('test.csv', sep='|', skiprows=range(1, 10))
Other ways to skip rows using read_csv
The two main ways to control which rows read_csv uses are the header or skiprows parameters.
Supose we have the following CSV file with one column:
a
b
c
d
e
f
In each of the examples below, this file is f = io.StringIO("\n".join("abcdef")).
Read all lines as values (no header, defaults to integers)
>>> pd.read_csv(f, header=None)
0
0 a
1 b
2 c
3 d
4 e
5 f
Use a particular row as the header (skip all lines before that):
>>> pd.read_csv(f, header=3)
d
0 e
1 f
Use a multiple rows as the header creating a MultiIndex (skip all lines before the last specified header line):
>>> pd.read_csv(f, header=[2, 4])
c
e
0 f
Skip N rows from the start of the file (the first row that's not skipped is the header):
>>> pd.read_csv(f, skiprows=3)
d
0 e
1 f
Skip one or more rows by giving the row indices (the first row that's not skipped is the header):
>>> pd.read_csv(f, skiprows=[2, 4])
a
0 b
1 d
2 f
Great answers already. Consider this generalized scenario:
Say your xls/csv has junk rows in the top 2 rows (row #0,1). Row #2 (3rd row) is the real header and you want to load 10 rows starting from row #50 (i.e 51st row).
Here's the snippet:
pd.read_csv('test.csv', header=2, skiprows=range(3, 50), nrows=10)
To expand on #AlexRiley's answer, the skiprows argument takes a list of numbers which determines what rows to skip. So:
pd.read_csv('test.csv', sep='|', skiprows=range(1, 10))
is the same as:
pd.read_csv('test.csv', sep='|', skiprows=[1,2,3,4,5,6,7,8,9])
The best way to go about ignoring specific rows would be to create your ignore list (either manually or with a function like range that returns a list of integers) and pass it to skiprows.
If you're iterating through a long csv file, you can use the chunksize argument. If for some reason you need to manually step through it, you can try the following as long as you know how many iterations you need to go through:
for i in range(num_iters):
pd.read_csv('test.csv', sep='|', header=0,
skiprows = range(i*10 + 1, (i+1)*10), nrows=10)
If you need to skip/drop specific rows, say the first 3 rows (i.e. 0,1,2) and then 2 more rows (i.e. 4,5). You can use the following to retain the header row:
df = pd.read_csv(file_in, delimiter='\t', skiprows=[0,1,2,4,5], encoding='utf-16', usecols=cols)