I have a df like below:
df = pd.DataFrame({"location":["north", "south","north"], "store": ["a","b","c"], "date" : ["02112018","02152018","02182018"], "barcode1":["ok", "low","ok"], "barcode2": ["low","zero","zero"], "barcode3": ["ok","zero","low"]})
what I would like to have is like below:
what I have done is to repeat each row, number of barcodes times with below code:
df_1 = pd.DataFrame(np.repeat(df.iloc[:,:3].values,len(df.iloc[0,:3:]),axis=0))
df_1.columns = df.columns[:3]
and having the below output:
however I do not know how to get to df_desired.
sorry that I could not find a suitable title.
any help would be appreciated.
You could use pd.melt to unpivot a dataframe, .sort_values by store gives you the desired order of rows.
pd.melt(
df,
id_vars=['location', 'store', 'date'],
var_name='barcode',
value_name='control').sort_values(by=['store'])
location store date barcode control
0 north a 02112018 barcode1 ok
3 north a 02112018 barcode2 low
6 north a 02112018 barcode3 ok
1 south b 02152018 barcode1 low
4 south b 02152018 barcode2 zero
7 south b 02152018 barcode3 zero
2 north c 02182018 barcode1 ok
5 north c 02182018 barcode2 zero
8 north c 02182018 barcode3 low
Related
I want to use a look up table of events to populate a new index column in a dataset.
I have a data frame that holds a batchcode list indexed by date and tank. Where date is the date the batch was allocated. These dates change intermittently whenever fish are moved between tanks.
batchcode
date tank
2016-01-02 LD TRE-20160102-A-1
LA TRE-20160102-B-1
2016-01-09 T8 TRE-20160109-C-1
LB TRE-20160109-C-1
2016-01-25 LA TRE-20160125-D-1
2016-01-27 LD TRE-20160102-A-2
LC TRE-20160102-A-3
2016-02-02 LD TRE-20160102-E-1
LA TRE-20160125-D-2
LB TRE-20160109-C-2
I have a second table that lists daily activities such as feeding, temperature observations etc.
Date Tank Property Value
0 2015-12-06 LC Green Water (g) 50.0
1 2015-12-07 LC Green Water (g) 50.0
2 2015-12-08 LC Green Water (g) 50.0
3 2015-12-09 LC Green Water (g) 50.0
4 2015-12-10 LC Green Water (g) 50.0
I want to add a new column to this 2nd table for batchcode where the value the batchcode from the first table. i.e I need to match on Tank and for each date find the batchcode set for that date - that is at the most recent previous entry.
What is the best way to solve this? My initial solution ends up running a function for every row of table 2 and this seems inefficient. I feel that I should be able to get table 1 to work as a simple indexed lookup
df2['batchcode'] = df1.find(df2['date', 'tank'], 'batchcode', method='pad') # pseudocode
Should I try to convert the tanks to columns?
how do I find the nearest most recent date in the index.
loading code
bcr_file = f'data/batch_code_records.csv'
bcr_df = pd.read_csv(bcr_file, dtype='str')
bcr_df['date'] = pd.to_datetime(bcr_df['change_date'])
bcr_df.drop(['unique_id', 'species', 'parent_indicator',
'change_date', 'entered_by', 'fish_class', 'origin_parents_wild_breed',
'change_reason_1', 'change_reason_2', 'batch_new', 'comments',
'fish_count', 'source_population_or_wild', 'source_batch_1',
'source_batch_2', 'batch_previous_1', 'batch_previous_2', 'tank_from_1',
'tank_from_2'], axis=1, inplace=True)
bcr_df.set_index(['date', 'tank'], inplace=True)
Source data file
unique_id,batchcode,tank,species,parent_indicator,change_date,entered_by,fish_class,origin_parents_wild_breed,change_reason_1,change_reason_2,batch_new,comments,fish_count,source_population_or_wild,source_batch_1,source_batch_2,batch_previous_1,batch_previous_2,tank_from_1,tank_from_2
TRE-20160102-A-1-LD-20160102,TRE-20160102-A-1,LD,TRE,A,20160102,andrew.watkins#plantandfood.co.nz,WILD OLD,WILD,unstocked_source_to_empty_destination,,,"tank_move_comment: 26905 stock to LA, 26905 LA to LD",26905.0,WILD,,,,,,
TRE-20160102-B-1-LA-20160102,TRE-20160102-B-1,LA,TRE,B,20160102,andrew.watkins#plantandfood.co.nz,WILD OLD,WILD,new_from_wild,,,"tank_move_comment: 26905 stock to LA, 26905 LA to LD",26905.0,WILD,,,,,,
etc.
I tried using get_indexer but couldn't get it to work with the 2 level index, and with only date as the index I get non-unique date warnings.
/we.tl/t-ghXIOjPznq
Here is my xlsx file.
https://imgur.com/b8kTbNV
I have such a dataframe. I want to define only for conditions where LITHOLOGY column is 1. In order to do that;
df2 = pd.read_excel('V131BLOG.xlsx')
LITHOLOGY = [1] &
df2[df2.LITHOLOGY.isin(LITHOLOGY)]
There hasn't been a problem so far. I was able to filter as I wanted.
https://imgur.com/wcSvokM
In addition to these, I want to see cells with LITHOLOGY column as 1 If It's thickness is bigger than 15cms. What I mean is that, the cumulative difference of consecutive cells of DEPTH_MD column should be bigger than 10cms. I have not made any progress on this. What path should I follow?
As you can see in this (https://imgur.com/a/02nlUUl) figure, there can be seen serial group of LITHOLOGY column as 1. But when you check the DEPTH_MD values, upper group is equal to 10cms, on the other side, lower group is equal 5cms. I want to create a dataframe that only contains bigger than 10cms DEPTH_MD values.
Input:
DEPTH_MD CALIPER GR LITHOLOGY SHALLOW DEEP
1980 329.00 26.8964 25.47160 2 2.99103 2.62130
1981 329.05 26.8574 32.54390 2 2.94772 2.58945
1982 329.10 27.1297 28.83750 1 2.90123 2.55601
1983 329.15 26.9742 17.91150 2 2.80383 2.52327
1984 329.20 28.3946 31.94310 2 2.76041 2.49050
1985 329.25 30.9402 17.63760 1 2.71992 2.46051
1986 329.30 35.2419 17.69170 1 2.67355 2.42852
1987 329.35 37.9206 17.74620 1 2.61838 2.33619
1988 329.40 39.9189 24.84460 2 2.56200 2.28671
1989 329.45 41.4947 7.03354 2 2.50669 2.23887
1990 329.50 41.5473 7.03354 2 2.42167 2.19944
1991 329.55 41.0158 10.58260 2 2.40039 2.17235
Output except:
DEPTH_MD CALIPER GR LITHOLOGY SHALLOW DEEP
1985 329.25 30.9402 17.6376 1 2.71992 2.46051
1986 329.30 35.2419 17.6917 1 2.67355 2.42852
1987 329.35 37.9206 17.7462 1 2.61838 2.33619
Group the consecutive 'LITHOLOGY' rows then compute the thickness and finally broadcast to all rows:
df['THICKNESS'] = (
df.groupby(df['LITHOLOGY'].ne(df['LITHOLOGY'].shift()).cumsum())['DEPTH_MD']
.transform(lambda x: x.diff().sum())
)
out = df[(df['LITHOLOGY'] == 1) & (df['THICKNESS'] >= 0.1)]
Output:
>>> out
DEPTH_MD CALIPER GR LITHOLOGY SHALLOW DEEP THICKNESS
1985 329.25 30.9402 17.6376 1 2.71992 2.46051 0.1
1986 329.30 35.2419 17.6917 1 2.67355 2.42852 0.1
1987 329.35 37.9206 17.7462 1 2.61838 2.33619 0.1
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I am unable to compare the column values of 2 different dataframes.
First dataset has 500 rows and second dataset has 128 rows. I am mentioning few of the rows of datasets.
First dataset:
Country_name Weather President
USA 16 Trump
China 19 Xi
2nd dataset
Country_name Weather Currency
North Korea 26 NKT
China 19 Yaun
I want to compare the country_name column because I don't have Currency column in dataset 1 , so if the country_name matches, then I can append its value. My final dataframe should be like this
Country_name Weather President Currency
USA 16 Trump Dollar
China 19 Xi Yaun
In the above final dataframes, we have to include only those countries whose country_name is present in both the datasets and corresponding value of Currency should be appended as shown above.
If you just want to keep records that only match in Country_name, and execlude everything else, you can then use the merge function, which basically finds the intersection between two dataframes based on some given column:
d1 = pd.DataFrame(data=[['USA', 16, 'Trump'], ['China', 19, 'Xi']],
columns=['Country_name', 'Weather', 'President'])
d2 = pd.DataFrame(data=[['North Korea', 26, 'NKT'], ['China', 19, 'Yun']],
columns=['Country_name', 'Weather', 'Currency'])
result = pd.merge(d1, d2, on=['Country_name'], how='inner')\
.rename(columns={'Weather_x': 'Weather'}).drop(['Weather_y'], axis=1)
print(result)
Output
Country_name Weather President Currency
0 China 19 Xi Yun
my Problem:
I have this DF:
df_problem = pd.DataFrame({"Share":['5%','6%','9%','9%', '9%'],"level_1":[0,0,1,2,3], 'BO':['Nestle', 'Procter', 'Nestle', 'Tesla', 'Jeff']})
The Problem is, that the 9% are actually divided by the three shareholders. So I want to giv each of them their share of 3% and put it to their names. It then should look like this:
df_solution = pd.DataFrame({"Share":['5%','6%','3%','3%', '3%'],"level_1":[0,0,0,1,2], 'BO': ['Nestle', 'Procter', 'Nestle', 'Tesla', 'Jeff']})
How do I do this in a simple way?
You could try something like this:
f_problem['Share'] = (f_problem['Share'].str.replace('%', '').astype(float) /
f_problem.groupby('Share')['BO'].
transform('count')).astype(str) + '%'
>>> f_problem
Share level_1 BO
0 5.0% 0 Nestle
1 6.0% 0 Procter
2 3.0% 1 Nestle
3 3.0% 2 Tesla
4 3.0% 3 Jeff
Please note that I have assumed that the value of the column 'Share' to be float as you could see above.
I have a dataframe of tweet data that is originally like this:
lang long lat hashtag country
1 it -69.940500 18.486700 DaddyYankeeAlertaRoja DO
2 it -69.940500 18.486700 QueremosConciertoDeAURA DO
3 it -69.940500 18.486700 LoQueDiceLaFoto DO
4 sv 14.167014 56.041735 escSE S
I have converted it into count information sorted by country and hashtag via:
d = pd.DataFrame({'count' : all_tweets.groupby(['country', 'hashtag']).size()}).reset_index()
d=
country hashtag count
0 A 100DaysofJapaneseLettering 3
1 A 100happydays 1
2 A 10cities1backpack 2
3 A 12points 6
... ... ... ...
848857 ZW reflections 1
848858 ZW saveKBD 1
848859 ZW sekuru 1
848860 ZW selfie 2
I ultimately want to plot the top hashtag per country. How do I take the max count for each country in the df and plot it?
I realized this question was a bit of a duplicate to Extract row with maximum value in a group pandas dataframe.
I extracted the most popular hashtag with this command:
max = d.iloc[d.groupby(['country']).apply(lambda x: x['count'].idxmax())]