SUMIF MATCH Conditional in Python - indexing

I'm trying to use a python version of excel MATCH, INDEX and SUMIF.
Example dataset:
A - 1
B - 2
A - 5
C - 6
A - 1
C - 2
So if I need to sum A it should output 7.
The issue is that I don't know the variable A - I need to use some universal variable that captures all values. So read out A = 7, B = 2, C = 8.

Related

Python altair - facet line plot with multiple variables

I have the following kind of DataFrame
Marque Annee Modele PVFP PM
0 A 1 Python 70783.066836 2.067821e+07
1 A 2 Python 75504.270716 1.957717e+07
2 A 3 Python 66383.237169 1.848982e+07
3 A 4 Python 61966.851675 1.755261e+07
4 A 5 Python 54516.367597 1.671907e+07
5 A 1 Sol 66400.686091 2.067821e+07
6 A 2 Sol 74953.770294 1.955218e+07
7 A 3 Sol 66500.916446 1.844078e+07
8 A 4 Sol 62016.941237 1.748098e+07
9 A 5 Sol 54356.008414 1.662684e+07
10 B 1 Python 43152.461787 1.340989e+07
11 B 2 Python 62397.794144 1.494418e+07
12 B 3 Python 1871.135251 2.178552e+06
I tried to build a facet graph but without really succeeding. I'am just able to concat vertically the 2 charts generated. I would be grateful if you have any idea to do it properly in one operation.
My current code :
chart = alt.Chart(euro).mark_line().encode(
x='Annee',
y='PVFP',
color='Modele'
).properties(
width=150,
height=150
).facet(
facet='Marque',
columns=3
)
chart2 = alt.Chart(euro).mark_line().encode(
x='Annee',
y='PM',
color='Modele'
).properties(
width=150,
height=150
).facet(
facet='Marque',
columns=3
)
chart & chart2
One good way to do this is to use a Fold Transform to fold your two columns into one, and then you can use row and column facets to facet by both variables at once. For example:
alt.Chart(euro).transform_fold(
['PVFP', 'PM'], as_=['key', 'value']
).mark_line().encode(
x='Annee:Q',
y='value:Q',
color='Modele:N'
).properties(
width=150,
height=150
).facet(
column='Marque:N',
row='key:N'
)

Calculating the difference between values based on their date

I have a dataframe that looks like this, where the "Date" is set as the index
A B C D E
Date
1999-01-01 1 2 3 4 5
1999-01-02 1 2 3 4 5
1999-01-03 1 2 3 4 5
1999-01-04 1 2 3 4 5
I'm trying to compare the percent difference between two pairs of dates. I think I can do the first bit:
start_1 = "1999-01-02"
end_1 = "1999-01-03"
start_2 = "1999-01-03"
end_2 = "1999-01-04"
Obs_1 = df.loc[end_1] / df.loc[start_1] -1
Obs_2 = df.loc[end_2] / df.loc[start_2] -1
The output I get from - eg Obs_1 looks like this:
A 0.011197
B 0.007933
C 0.012850
D 0.016678
E 0.007330
dtype: float64
I'm looking to build some correlations between Obs_1 and Obs_2. I think I need to create a new dataframe with the labels A-E as one column (or as the index), and then the data series from Obs_1 and Obs_2 as adjacent columns.
But I'm struggling! I can't 'see' what Obs_1 and Obs_2 'are' - have I created a list? A series? How can I tell? What would be the best way of combining the two into a single dataframe...say df_1.
I'm sure the answer is staring me in the face but I'm going mental trying to figure it out...and because I'm not quite sure what Obs_1 and Obs_2 'are', it's hard to search the SO archive to help me.
Thanks in advance

Can I use pandas to create a biased sample?

My code uses a column called booking status that is 1 for yes and 0 for no (there are multiple other columns that information will be pulled from dependant on the booking status) - there are lots more no than yes so I would like to take a sample with all the yes and the same amount of no.
When I use
samp = rslt_df.sample(n=298, random_state=1, weights='bookingstatus')
I get the error:
ValueError: Fewer non-zero entries in p than size
Is there a way to do this sample this way?
If our entire dataset looks like this:
print(df)
c1 c2
0 1 1
1 0 2
2 0 3
3 0 4
4 0 5
5 0 6
6 0 7
7 1 8
8 0 9
9 0 10
We may decide to sample from it using the DataFrame.sample function. By default, this function will sample without replacement. Meaning, you'll receive an error by specifying a number of observations larger than the number of observations in your initial dataset:
df.sample(20)
ValueError: Cannot take a larger sample than population when 'replace=False'
In your situation, the ValueError comes from the weights parameter:
df.sample(3,weights='c1')
ValueError: Fewer non-zero entries in p than size
To paraphrase the DataFrame.sample docs, using the c1 column as our weights parameter implies that rows with a larger value in the c1 column are more likely to be sampled. Specifically, the sample function will not pick values from this column that are zero. We can fix this error using either one of the following methods.
Method 1: Set the replace parameter to be true:
m1 = df.sample(3,weights='c1', replace=True)
print(m1)
c1 c2
0 1 1
7 1 8
0 1 1
Method 2: Make sure the n parameter is equal to or less than the number of 1s in the c1 column:
m2 = df.sample(2,weights='c1')
print(m2)
c1 c2
7 1 8
0 1 1
If you decide to use this method, you won't really be sampling. You're really just filtering out any rows where the value of c1 is 0.
I was able to this in the end, here is how I did it:
bookingstatus_count = df.bookingstatus.value_counts()
print('Class 0:', bookingstatus_count[0])
print('Class 1:', bookingstatus_count[1])
print('Proportion:', round(bookingstatus_count[0] / bookingstatus_count[1], 2), ': 1')
# Class count
count_class_0, count_class_1 = df.bookingstatus.value_counts()
# Divide by class
df_class_0 = df[df['bookingstatus'] == 0]
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([f_class_0_under, df_class_1], axis=0)
df_class_1 = df[df['bookingstatus'] == 1]
based on this https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Thanks everyone

Extracting a word from string from n rows and append that word as a new col in SQL Server

I have got a data set that contains 3 columns and has 15565 observations. one of the columns has got several words in the same row.
What I am looking to do is to extract a particular word from each row and append it to a new column (i will have 4 cols in total)
The problem is that the word that i am looking for are not the same and they are not always on the same position.
Here is an extract of my DS:
x y z
-----------------------------------------------------------------------
1 T 3C00652722 (T558799A)
2 T NA >> MSP: T0578836A & 3C03024632
3 T T0579010A, 3C03051500, EAET03051496
4 U T0023231A > MSP: T0577506A & 3C02808556
8 U (T561041A C72/59460)>POPMigr.T576447A,C72/221816*3C00721502
I am looking to extract all the words that start with 3Cand are 10 characters long and then append the to a new col so it looks like this:
x y z Ref
----------------------------------------------------------------
1 T 3C00652722 (T558799A) 3C00652722
2 T NA >> MSP: T0578836A & 3C03024632 3C03024632
3 T T0579010A, 3C03051500, EAET03051496 3C03051500
4 U T0023231A > MSP: T0577506A & 3C02808556 3C02808556
8 U >POPMigr.T576447A,C72/221816*3C00721502 3C00721502
I have tried using the Contains, Like and substring methods but it does not give me the results i am looking for as it basically finds the rows that have the 3C number but does not extract it, it just copies the whole cell and pastes is on the Ref column.
SQL Server doesn't have good string functions, but this should suffice if you only want to extract one value per row:
select t.*,
left(stuff(col,
1,
patindex('%3C[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', col),
''
), 10)
from t ;

change specific data of column

I have a table in which 10 record, now i want to update specific column data, means some part some column data and some not, for example in row 1 i want to change std with standard and other data will remain same, change same thing in all row in a single query. can it will be possible? and remember we cant remove and add cell again because it will change id
id - col1 - col2
1 - A - std abcad
2 - B - std bcddsad
3 - C - std avadsad
4 - A - std abcdsad
5 - B - std bcddsa
6 - C - std avadsad
7 - A - std abcdsd
8 - B - std bcddsds
9 - C - std avadsd
You can use the replace function for this
Update
table
Set
col2 = Replace(col2, 'std', 'standard');
UPDATE tblName
SET Column .WRITE('Standard',(CHARINDEX('std',Column,1)-1),LEN('std'))