I have a df_trg with, say 10 rows numbered 0-9.
I get from various sources values for an additional column foo which contains only a subset of rows, e.g. S1 has 0-3, 7, 9 and S2 has 4, 6.
I would like to get a data frame with a single new column foo where some rows may remain NaN.
Is there a "nicer" way other than:
df_trg['foo'] = np.nan
for src in sources:
df_trg['foo'][df_trg.index.isin(src.index)] = src
for example, using join or merge?
Let's create the source DataFrame (df), s1 and s2 (Series objects with
updating data) and a list of them (sources):
df = pd.DataFrame(np.arange(1, 51).reshape((5, -1)).T)
s1 = pd.Series([11, 12, 13, 14, 15, 16], index=[0, 1, 2, 3, 7, 9])
s2 = pd.Series([27, 28], index=[4, 6])
sources = [s1, s2]
Start the computation from adding foo column, initially filled with
an empty string:
df = df.assign(foo='')
Then run the following "updating" loop:
for src in sources:
df.foo.update(other=src)
The result is:
0 1 2 3 4 foo
0 1 11 21 31 41 11
1 2 12 22 32 42 12
2 3 13 23 33 43 13
3 4 14 24 34 44 14
4 5 15 25 35 45 27
5 6 16 26 36 46
6 7 17 27 37 47 28
7 8 18 28 38 48 15
8 9 19 29 39 49
9 10 20 30 40 50 16
In my opinion, this solution is (at least a little) nicer than yours and
shorter.
Alternative: Fill foo column initially with NaN, but this time
updating values will be converted to float (side effect of using NaN).
Related
I want to drop both rows in a pandas data frame where the value in one column(account) is not duplicate and the value in some other column (recharge_number) is duplicate given A. An illustrative example:
data = {'account': [43,43,43,43,45,45],
'recharge_number': [17777, 17777, 17999, 17888, 17222, 17999] ,
'year': [2021,2021,2021,2021,2020,2020],
'month': [2,3,5,6,2,9]}
account recharge_number year month
43 17777 2021 2
43 17777 2021 3
43 17999 2021 5
43 17888 2021 6
45 17222 2020 2
45 17999 2020 9
input data
output:
account recharge_number year month
43 17777 2021 2
43 17777 2021 3
43 17888 2021 6
45 17222 2020 2
output data
Another method is to drop rows instead of keep them:
>>> df.drop(df[~df.duplicated(['id', 'number'], keep=False)
& df.duplicated('number', keep=False)].index)
id number
0 5 10
1 5 10
3 6 20
5 7 40
The first condition protect all duplicate ('id', 'number') records. The second condition remove all records where 'number' are the same.
Basically, you want "the full row (or the two columns if larger dataframe) is duplicated" or "number is not duplicated"
You can use duplicated:
df[df['id', 'number'].duplicated(keep=False)|~df['number'].duplicated(keep=False)]
Output:
id number
0 5 10
1 5 10
3 6 20
5 7 40
Solution with .crosstab:
mask = pd.crosstab(df["account"], df["recharge_number"]).ne(0).sum().gt(1)
print(df[~df["recharge_number"].isin(mask[mask].index)])
Prints:
account recharge_number year month
0 43 17777 2021 2
1 43 17777 2021 3
3 43 17888 2021 6
4 45 17222 2020 2
I have a dataframe like this:
df = pd.DataFrame(np.random.randint(50, size=(4, 4),
index=[['a', 'a', 'b', 'b'], [800, 900, 800, 900]],
columns=['X', 'Y', 'r_value', 'z_value'])
df.index.names = ["dat", "recor"]
X Y r_value z_value
dat recor
a 800 14 28 12 18
900 47 34 59 49
b 800 33 18 24 33
900 18 25 44 19
...
I want to apply a function to create a new column based on r_value that gives values only for the case of recor==900, so, in the end I would like something like:
X Y r_value z_value BB
dat recor
a 800 14 28 12 18 NaN
900 47 34 59 49 0
b 800 33 18 24 33 NaN
900 18 25 44 19 2
...
I have created the function like:
x = df.loc[pd.IndexSlice[:,900], "r_value"]
conditions = [x >=70, np.logical_and(x >= 40, x < 70), \
np.logical_and(x >= 10, x < 40), x <10]
choices = [0, 1, 2, 3]
BB = np.select(conditions, choices)
So now I need to append BB as a column, filling with NaNs the rows corresponding to recor==800. How can I do it? I have tried a couple of ideas (not commented here) without result. Thx.
Try
df.loc[df.index.get_level_values('recor')==900, 'BB'] = BB
the part df.index.get_level_values('recor')==900 creates a boolean array with True where the index level "recor" equals 900
indexing using a columns that does not already exist, ie "BB" creates that new column.
The rest of the column should automatically be filled with NaN.
I cant test it since you didn't include a minimal reproducible example.
I am tryning to sum all values in a range of columns from the third to last of several thousand columns using:
day3prep['D3counts'] = day3prep.sum(day3prep.iloc[:, 2:].sum(axis=1))
dataframe is formated as:
ID G1 Z1 Z2 ...ZN
0 50 13 12 ...62
1 51 62 23 ...19
dataframe with summed column:
ID G1 Z1 Z2 ...ZN D3counts
0 50 13 12 ...62 sum(Z1:ZN in row 0)
1 51 62 23 ...19 sum(Z1:ZN in row 1)
I've changed the NaNs to 0's. The datatype is float but I am getting the error:
'Series' objects are mutable, thus they cannot be hashed
You only need this part:
day3prep['D3counts'] = day3prep.iloc[:, 2:].sum(axis=1)
With some random numbers:
import pandas as pd
import random
random.seed(42)
day3prep = pd.DataFrame({'ID': random.sample(range(10), 5), 'G1': random.sample(range(10), 5),
'Z1': random.sample(range(10), 5), 'Z2': random.sample(range(10), 5), 'Z3': random.sample(range(10), 5)})
day3prep['D3counts'] = day3prep.iloc[:, 2:].sum(axis=1)
Output:
> day3prep
ID G1 Z1 Z2 Z3 D3counts
0 1 2 0 8 8 16
1 0 1 9 0 6 15
2 4 8 1 3 3 7
3 9 4 7 5 7 19
4 6 3 6 6 4 16
i have a big query datatable with 512 variables as arrays with quite the long names (x__x_arrVal_arrSlices_0__arrValues to arrSlices_511). In each array are 360 values. the bi-tool cannot compute an array in this form. this is the reason why i want to have each value as an output.
the query excerpt i use right now is:
SELECT
timestamp, x_stArrayTag_sArrayName, x_stArrayTag_sComission,
1 as row,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(1)] AS f001,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(10)] AS f010,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(20)] AS f020,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(30)] AS f030,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(40)] AS f040,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(50)] AS f050,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(60)] AS f060,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(70)] AS f070,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(80)] AS f080,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(90)] AS f090,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(100)] AS f100,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(110)] AS f110,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(120)] AS f120,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(130)] AS f130,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(140)] AS f140,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(150)] AS f150,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(160)] AS f160,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(170)] AS f170,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(180)] AS f180,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(190)] AS f190,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(200)] AS f200,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(210)] AS f210,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(220)] AS f220,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(230)] AS f230,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(240)] AS f240,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(250)] AS f250,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(260)] AS f260,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(270)] AS f270,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(280)] AS f280,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(290)] as f290,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(300)] AS f300,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(310)] AS f310,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(320)] AS f320,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(330)] AS f330,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(340)] AS f340,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(350)] AS f350,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(359)] AS f359
FROM
`project.table`
WHERE
_PARTITIONTIME >= "2017-01-01 00:00:00"
AND _PARTITIONTIME < "2018-02-16 00:00:00"
UNION ALL
The output i get is unfortunately only a fracture of all values. getting all 512*360 values with this query is not possible because if i used this query for all slices i reach the limit of bigquery.
is there a possibility to rename the the long name and to select a range?
best regards
scotti
You can get 360 rows and 512 columns by using UNNEST. Here is a small example:
WITH data AS (
SELECT
[1, 2, 3, 4] as a,
[2, 3, 4, 5] as b,
[3, 4, 5, 6] as c
)
SELECT v1, b[OFFSET(off)] as v2, c[OFFSET(off)] as v3
FROM data, unnest(a) as v1 WITH OFFSET off
Output:
v1 v2 v3
1 2 3
2 3 4
3 4 5
4 5 6
Having in mind a little messy table you are dealing with - in making decision on restructuring the important aspect is practicality of query to implement that decision
In your specific case - I would recommend full flattening of the data like below (each row will be transformed into ~180000 rows each representing one of the elements of one of the array in original row - slice field will represent array number and pos will represent element position in that array) - query is generic enough to handle any number/names of slices and array sizes and at the same time result is flexible and also generic enough to be used in any imaginable algorithm
#standardSQL
SELECT
id,
slice,
pos,
value
FROM `project.dataset.messytable` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"x__x_arrVal_arrSlices_(\d+)":\[.*?\]')) slice WITH OFFSET x
JOIN UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"x__x_arrVal_arrSlices_\d+":\[(.*?)\]')) arr WITH OFFSET y
ON x = y,
UNNEST(SPLIT(arr)) value WITH OFFSET pos
you can test/play with it using below dummy example
#standardSQL
WITH `project.dataset.messytable` AS (
SELECT 1 id,
[ 1, 2, 3, 4, 5] x__x_arrVal_arrSlices_0,
[11, 12, 13, 14, 15] x__x_arrVal_arrSlices_1,
[21, 22, 23, 24, 25] x__x_arrVal_arrSlices_2 UNION ALL
SELECT 2 id,
[ 6, 7, 8, 9, 10] x__x_arrVal_arrSlices_0,
[16, 17, 18, 19, 20] x__x_arrVal_arrSlices_1,
[26, 27, 28, 29, 30] x__x_arrVal_arrSlices_2
)
SELECT
id,
slice,
pos,
value
FROM `project.dataset.messytable` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"x__x_arrVal_arrSlices_(\d+)":\[.*?\]')) slice WITH OFFSET x
JOIN UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"x__x_arrVal_arrSlices_\d+":\[(.*?)\]')) arr WITH OFFSET y
ON x = y,
UNNEST(SPLIT(arr)) value WITH OFFSET pos
the result is as below
Row id slice pos value
1 1 0 0 1
2 1 0 1 2
3 1 0 2 3
4 1 0 3 4
5 1 0 4 5
6 1 1 0 11
7 1 1 1 12
8 1 1 2 13
9 1 1 3 14
10 1 1 4 15
11 1 2 0 21
12 1 2 1 22
13 1 2 2 23
14 1 2 3 24
15 1 2 4 25
16 2 0 0 6
17 2 0 1 7
18 2 0 2 8
19 2 0 3 9
20 2 0 4 10
21 2 1 0 16
22 2 1 1 17
23 2 1 2 18
24 2 1 3 19
25 2 1 4 20
26 2 2 0 26
27 2 2 1 27
28 2 2 2 28
29 2 2 3 29
30 2 2 4 30
My data frame looks something like this:
Games
0 CAR 20
1 DEN 21
2 TB 31
3 ATL 24
4 SD 27
5 KC 33
6 CIN 23
7 NYJ 22
import pandas as pd
df =pd.read_csv('weekone.txt',)
df.columns=['Games']
I'm trying to put a blank line in between every two elements (teams).
So I want it to look like this:
Games
0 CAR 20
1 DEN 21
2 TB 31
3 ATL 24
4 SD 27
5 KC 33
6 CIN 23
7 NYJ 22
But when I'm using this loop
for i in df2.index:
if (df2.index[i])%2 == 1:
df2.Games[i]=df2.Games[i]+('\n')
else:
df2.Games[i] = df2.Games[i]
I'm getting an output like this:
Games
0 CAR 20
1 DEN 21\n
2 TB 31
3 ATL 24\n
4 SD 27
5 KC 33\n
6 CIN 23
7 NYJ 22\n
What am I doing wrong? Thanks.
you can do it this way:
In [172]: x
Out[172]:
Games
0 CAR 20
1 DEN 21
2 TB 31
3 ATL 24
4 SD 27
5 KC 33
6 CIN 23
7 NYJ 22
In [173]: %paste
empty_line = pd.DataFrame([''], columns=x.columns, index=[''])
rslt = x.loc[:1]
g = x.groupby(x.index//2)
for i in range(1, len(g)):
rslt = pd.concat([rslt.append(empty_line), g.get_group(i)])
## -- End pasted text --
In [174]: rslt
Out[174]:
Games
0 CAR 20
1 DEN 21
2 TB 31
3 ATL 24
4 SD 27
5 KC 33
6 CIN 23
7 NYJ 22
the index's dtype is object now:
In [178]: rslt.index.dtype
Out[178]: dtype('O')
or having -1 as an index for empty lines:
In [175]: %paste
empty_line = pd.DataFrame([''], columns=x.columns, index=[-1])
rslt = x.loc[:1]
g = x.groupby(x.index//2)
for i in range(1, len(g)):
rslt = pd.concat([rslt.append(empty_line), g.get_group(i)])
## -- End pasted text --
In [176]: rslt
Out[176]:
Games
0 CAR 20
1 DEN 21
-1
2 TB 31
3 ATL 24
-1
4 SD 27
5 KC 33
-1
6 CIN 23
7 NYJ 22
index dtype:
In [181]: rslt.index.dtype
Out[181]: dtype('int64')