combine csv files, sort them by time and average the colums - pandas

I have many datasets in csv files they look like in the picture that I attached.
In the first column is always the time in minutes, but the time steps and the total number of rows differ between the raw data files. I'd like to have one output file (csv file) in which all the raw files are combined and sorted by the time. So that the time increases from the top to the bottom of the column.
raw data and output
The concentration column should be averaged, when more than one number exists.
I tried like this:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
d1 = pd.read_csv('dat1.txt', sep="\t")
d2 = pd.read_csv('dat2.txt', sep="\t")
d1.columns
d2.columns
merged_outer = pd.merge(d1,d2, on='time', how='outer')
print merged_outer
but it doesn't lead to the correct output. I'm a beginner in Pandas but I hope I explaind the problem well enough. Thank you for any idea or suggestion!
Thank you for your idea. Unfortunately, when I run it I get an error message saying that dat1.txt doesn't exist. This seems strange to me as I read the raw files initially by:
d1 = pd.read_csv('dat1.txt', sep="\t")
d2 = pd.read_csv('dat2.txt', sep="\t")
Sorry, here the data as raw text:
raw data 1
time column2 column3 concentration
1 2 4 3
2 2 4 6
4 2 4 2
7 2 4 5
raw data 2
time column2 column3 concentration
1 2 4 6
2 2 4 2
8 2 4 9
10 2 4 5
12 2 4 7

Something like this might work
filenames = ['dat1.txt', 'dat2.txt',...]
dataframes = {filename: pd.read_csv(filename, sep="\t") for filename in filenames}
merged_outer = pd.concat(dataframes).groupby('time').mean()
When you pass a dict to pd.concat, it creates a MultiIndex DataFrame with the dict keys as level0

Related

Panda's dataframe not throwing error for rows containing lesser fields

I am facing one issue when reading rows containing lesser fields my dataset looks like below.
"Source1"~"schema1"~"table1"~"modifiedon"~"timestamp"~"STAGE"~15~NULL~NULL~FALSE~FALSE~TRUE
"Source1"~"schema2"~"table2"
and I am running below command to read the dataset.
tabdf = pd.read_csv('test_table.csv',sep='~',header = None)
But its not throwing any error though its supposed too.
The version we are using
pip show pandas
Name: pandas
Version: 1.0.1
My question is how to make the process failed if we will get incorrect data structure.
You could inspect the data first using Pandas and then either fail the process if bad data exists or just read the known-good rows.
Read full rows into a dataframe
df = pd.read_fwf('test_table.csv', widths=[999999], header=None)
print(df)
0
0 "Source1"~"schema1"~"table1"~"modifiedon"~"tim...
1 "Source1"~"schema2"~"table2"
Count number of separators
sep_count = df[0].str.count('~')
sep_count
0 11
1 2
Maybe just terminate the process if bad (short) rows
If the number of unique values are not 1.
sep_count.nunique()
2
Or just read the good rows
good_rows = sep_count.eq(11) # if you know what separator count should be. Or ...
good_rows = sep_count.eq(sep_count.max()) # if you know you have at least 1 good row
df = pd.read_csv('test_table.csv', sep='~', header=None).loc[good_rows]
print(df)
Result
0 1 2 3 4 5 6 7 8 9 10 11
0 Source1 schema1 table1 modifiedon timestamp STAGE 15.0 NaN NaN False False True

Pandas Merge function only giving column headers - Update

What I want to achieve.
I have two data frames. DF1 and DF2. Both are being read from different excel file.
DF1 has 9 columns and 3000 rows, of which one of the column name is "Code Group".
DF2 has 2 columns and 20 rows, of which one of the column name is also "Code Group". In this same dataframe another column "Code Management Method" gives the explanation of code group. For eg. H001 is Treated at recyclable, H002 is Treated as landfill.
What happens
When I use the command data = pd.merge(DF1,DF2, on='Code Group') I only get 10 column names but no rows underneath.
What I expect
I would want DF1 and DF2 to be merged and wherever Code Group number is same Code Management Method to be pasted for explanation.
Additional information
Following are datatype for DF1
Entity object
Address object
State object
Site object
Disposal Facility object
Pounds float64
Waste Description object
Shipment Date datetime64[ns]
Code Group object
FollOwing are datatype for DF2
Code Management Method object
Code Group object
What I tried
I tried to follow the suggestions from similar post on SO that the datatypes on both sides should be same and Code Group here both are objects so don't know what's the issue. I also tried Concat function.
Code
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
CH = "C:\Python\Waste\Shipment.xls"
Code = "C:\Python\Waste\Code.xlsx"
Data = pd.read_excel(Code)
data1 = pd.read_excel(CH)
data1.rename(columns={'generator_name':'Entity','generator_address':'Address', 'Site_City':'Site','final_disposal_facility_name':'Disposal Facility', 'wst_dscrpn':'Waste Description', 'drum_wgt':'Pounds', 'wst_dscrpn' : 'Waste Description', 'genrtr_sgntr_dt':'Shipment Date','generator_state': 'State','expected_disposal_management_methodcode':'Code Group'},
inplace=True)
data2 = data1[['Entity','Address','State','Site','Disposal Facility','Pounds','Waste Description','Shipment Date','Code Group']]
data2
merged = data2.merge(Data, on='Code Group')
Getting a Warning
C:\Anaconda\lib\site-packages\pandas\core\generic.py:5890: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._update_inplace(new_data)
import pandas as pd
df1 = pd.DataFrame({'Region': [1,2,3],
'zipcode':[12345,23456,34567]})
df2 = pd.DataFrame({'ZipCodeLowerBound': [10000,20000,30000],
'ZipCodeUpperBound': [19999,29999,39999],
'Region': [1,2,3]})
df1.merge(df2, on='Region')
this is how the example is given, and the result for this is:
Region zipcode
0 1 12345
1 2 23456
2 3 34567
Region ZipCodeLowerBound ZipCodeUpperBound
0 1 10000 19999
1 2 20000 29999
2 3 30000 39999
and that thing will result in
Region zipcode ZipCodeLowerBound ZipCodeUpperBound
0 1 12345 10000 19999
1 2 23456 20000 29999
2 3 34567 30000 39999
I hope this is what you want to do
After multiple tries I found that the column had some garbage so used the code below and it worked perfectly. Funny thing is I never encountered the problem on two other data-sets that I imported from excel file.
data2['Code'] = data2['Code'].str.strip()

How to create new pandas column by vlookup-like procedure on another data-frame

I have a dataframe that looks like this. It will be used to map values using two categorical variables. Maybe converting this to a dictionary would be better.
The 2nd data-frame is very large with a screenshot shown below. I want to take the values from the categorical variables to create a new attribute (column) based on the 1st data-frame.
For example...
A row with FICO_cat of (700,720] and OrigLTV_cat of (75,80] would receive a value of 5.
A row with FICO_cat of (700,720] and OrigLTV_cat of (85,90] would receive a value of 6.
Is there an efficient way to do this?
If your column labels are the FICO_cat values, and your Index is OrigLTV_cat, this should work:
Given a dataframe df:
780+ (740,780) (720,740)
(60,70) 3 3 3
(70,75) 4 5 4
(75,80) 3 1 2
Do:
df = df.unstack().reset_index()
df.rename(columns = {'level_0' : 'FICOCat', 'level_1' : 'OrigLTV', 0 : 'value'}, inplace = True)
Output:
FICOCat OrigLTV value
0 780+ (60,70) 3
1 780+ (70,75) 4
2 780+ (75,80) 3
3 (740,780) (60,70) 3
4 (740,780) (70,75) 5
5 (740,780) (75,80) 1
6 (720,740) (60,70) 3
7 (720,740) (70,75) 4
8 (720,740) (75,80) 2

how to convert a column of pandas series without the header

It is quite odd as I hadn't experienced the issue until now,, for conversion of data series.
So I have wind speed data by date & hour at different heights, retrieved from NREL.
file09 = 'wind/wind_yr2009.txt'
wind09 = pd.read_csv(file09, encoding = "utf-8", names = ['DATE (MM/DD/YYYY)', 'HOUR-MST', 'AWS#20m [m/s]', 'AWS#50m [m/s]', 'AWS#80m [m/s]', 'AMPLC(2-80m)'])
file10 = 'wind/wind_yr2010.txt'
wind10 = pd.read_csv(file10, encoding = "utf-8", names = ['DATE (MM/DD/YYYY)', 'HOUR-MST', 'AWS#20m [m/s]', 'AWS#50m [m/s]', 'AWS#80m [m/s]', 'AMPLC(2-80m)'])
I merge the two readings of .txt files below
wind = pd.concat([wind09, wind10], join='inner')
Then drop the duplicate headings..
wind = wind.reset_index().drop_duplicates(keep='first').set_index('index')
print(wind['HOUR-MST'])
Printing would return smth like the following -
index
0 HOUR-MST
1 1
2 2
I wasn't sure at first but apparently index 0 is on HOUR-MST, which is the column heading. Python does recognize it as I can infer the column data using the specific header. Yet, when I try converting into int
temp = hcodebook.iloc[wind['HOUR-MST'].astype(int) - 1]
Both errors were returned, as I later tried to convert to float
ValueError: invalid literal for int() with base 10: 'HOUR-MST'
ValueError: could not convert string to float: 'HOUR-MST'
I verified it is only the 0th index that has strings by using try/except in for loop.
I think the reason is because I didnt' use the parameter sep when reading these file - as that is the only difference with the previous attempts with other files where the data conversion is troubling me.
Yet it doesn't necessarily enlighten me in how to address it.
Kindly advise.
MCVE:
from io import StringIO
import pandas as pd
cfile = StringIO("""A B C D
1 2 3 4
5 6 7 8""")
pd.read_csv(cfile, names=['a','b','c','d'], sep='\s\s+')
Header included in data:
a b c d
0 A B C D
1 1 2 3 4
2 5 6 7 8
Use skiprows to avoid getting headers:
from io import StringIO
import pandas as pd
​
cfile = StringIO("""A B C D
1 2 3 4
5 6 7 8""")
pd.read_csv(cfile, names=['a','b','c','d'], sep='\s\s+', skiprows=1)
No headers:
a b c d
0 1 2 3 4
1 5 6 7 8

Python Pandas extract columns from .csv

I have a .csv file, which I can read in Pandas. The .csv file looks like the following.
a b c d e
1 4 3 2 5
6 7 8 3 6
...
What I need to achieve is, that I can extract a and b as a column vector and
[c d e] as a matrix. I used Pandas with the following code to read the .csv file:
pd.read_csv('data.csv', sep=',',header=None)
But this will give me a vector like this: [[a,b,c,d,e],[1,4,3,2,5],...]
How can I extract the columns? I heared about df.iloc, but this cannot be used here, since after pd.read_csv there is only one column.
You should be able to do that with:
ds = pd.read_csv('data.csv', sep=',',header=0)
column_a = ds["a"]
matrix = ds[["c","d","e"]]