Add two elements in a dataframe (based on the index) - pandas

I have a dataframe in which some rows are useless except for one variable.
I want to add that the variable in those rows to the previous row and then delete the useless rows.
In the data frame there are some rows in which the only useful information is on a variable, so I want to preserve this information.
More precisely, my dataframe looks something like
|cat1| cat2|var1|var2|
|A |x |1 |2 |
|A |x |1 |0 |
|A |x |. |5 |
|A |y |1 |2 |
|A |y |1 |2 |
|A |y |1 |3 |
|A |y |. |6 |
|B |x |1 |2 |
|B |x |1 |4 |
|B |x |1 |2 |
|B |x |1 |1 |
|B |x |. |3 |
and i want to get
|cat1| cat2|var1|var2|
|A |x |1 |2 |
|A |x |1 |5(5+0)|
|A |y |1 |2 |
|A |y |1 |2 |
|A |y |1 |9(6+3)|
|B |x |1 |2 |
|B |x |1 |4 |
|B |x |1 |2 |
|B |x |1 |4(3+1)|
iI've tried code like
test = df[df['var1'] == '.'].index
for num in test:
df['var2][num - 1] = df['var2][num - 1] + df['var2][num]
but it doesn't work.
Any help would be appreciated.

For a very readable solution combine np.where to select the rows where the shifted rows of var1 contain .. Use the -1 to select the next row. If that's the case add the next row, otherwise just fill the original row. Afterwards, just drop all the rows with a .
df['var2_new'] = np.where(df['var1'].shift(-1) == '.',
df['var2'] + df['var2'].shift(-1), df['var2'])
df[df['var1'] != '.']
# cat1 cat2 var1 var2 var2_new
#0 A x 1 2 2.0
#1 A x 1 0 5.0
#3 A y 1 2 2.0
#4 A y 1 2 2.0
#5 A y 1 3 9.0
#7 B x 1 2 2.0
#8 B x 1 4 4.0
#9 B x 1 2 2.0
#10 B x 1 1 4.0

Related

Add rows of data to each group in a Spark dataframe

I have this dataframe -
data = [(0,1,1,201505,3),
(1,1,1,201506,5),
(2,1,1,201507,7),
(3,1,1,201508,2),
(4,2,2,201750,3),
(5,2,2,201751,0),
(6,2,2,201752,1),
(7,2,2,201753,1)
]
cols = ['id','item','store','week','sales']
data_df = spark.createDataFrame(data=data,schema=cols)
display(data_df)
What I want it this -
data_new = [(0,1,1,201505,3,0),
(1,1,1,201506,5,0),
(2,1,1,201507,7,0),
(3,1,1,201508,2,0),
(4,1,1,201509,0,0),
(5,1,1,201510,0,0),
(6,1,1,201511,0,0),
(7,1,1,201512,0,0),
(8,2,2,201750,3,0),
(9,2,2,201751,0,0),
(10,2,2,201752,1,0),
(11,2,2,201753,1,0),
(12,2,2,201801,0,0),
(13,2,2,201802,0,0),
(14,2,2,201803,0,0),
(15,2,2,201804,0,0)]
cols_new = ['id','item','store','week','sales','flag',]
data_df_new = spark.createDataFrame(data=data_new,schema=cols_new)
display(data_df_new)
So basically, I want 8 (this can also be 6 or 10) weeks of data for each item-store groupby combination. Wherever the 52/53 weeks for the year ends, I need the weeks for the next year, as I have mentioned in the sample. I need this in PySpark, thanks in advance!
See my attempt below. Could have made it shorter but felt should be as explicit as I can so I dint chain the soultions. code below
from pyspark.sql import functions as F
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
# Convert week of the year to date
s=data_df.withColumn("week", expr("cast (week as string)")).withColumn('date', F.to_date(F.concat("week",F.lit("6")), "yyyywwu"))
s = (s.groupby('item', 'store').agg(F.collect_list('sales').alias('sales'),F.collect_list('date').alias('date'))#Put sales and dates in an array
.withColumn("id", sequence(lit(0), lit(6)))#Create sequence ids with the required expansion range per group
)
#Explode datframe back with each item/store combination in a row
s =s.selectExpr('item','store','inline(arrays_zip(date,id,sales))')
#Create partition window broadcasting from start to end for each item/store combination
w = Window.partitionBy('item','store').orderBy('id').rowsBetween(-sys.maxsize, sys.maxsize)
#Create partition window broadcasting from start to end for each item/store/date combination. the purpose here is to aggregate over null dates as group
w1 = Window.partitionBy('item','store','date').orderBy('id').rowsBetween(Window.unboundedPreceding, Window.currentRow)
s=(s.withColumn('increment', F.when(col('date').isNull(),(row_number().over(w1))*7).otherwise(0))#Create increment values per item/store combination
.withColumn('date1', F.when(col('date').isNull(),max('date').over(w)).otherwise(col('date')))#get last date in each item/store combination
)
# #Compute the week of year and drop columns not wanted
s = s.withColumn("weekofyear", expr("weekofyear(date_add(date1, cast(increment as int)))")).drop('date','increment','date1').na.fill(0)
s.show(truncate=False)
Outcome
+----+-----+---+-----+----------+
|item|store|id |sales|weekofyear|
+----+-----+---+-----+----------+
|1 |1 |0 |3 |5 |
|1 |1 |1 |5 |6 |
|1 |1 |2 |7 |7 |
|1 |1 |3 |2 |8 |
|1 |1 |4 |0 |9 |
|1 |1 |5 |0 |10 |
|1 |1 |6 |0 |11 |
|2 |2 |0 |3 |50 |
|2 |2 |1 |0 |51 |
|2 |2 |2 |1 |52 |
|2 |2 |3 |1 |1 |
|2 |2 |4 |0 |2 |
|2 |2 |5 |0 |3 |
|2 |2 |6 |0 |4 |
+----+-----+---+-----+----------+

BigQuery - How can I find the closest row to any other given row within the same table?

I have a table which looks like this:
-----------------------------------
|location|address|latitude|longitude|
-----------------------------------
|1 |a |20 |21 |
|1 |b |21 |22 |
|1 |c |23 |24 |
|2 |d |45 |50 |
|2 |e |46 |47 |
|2 |f |40 |45 |
-----------------------------------
I am trying to find which row is the closest (distance wise) to any given row of the table (for each location group) and to return this distance as a new column:
Expected output
--------------------------------------------
|location|address|latitude|longitude|distance|
--------------------------------------------
|1 |a |20 |21 |1.41 | <- Closest neighbour is b
|1 |b |21 |22 |1.41 | <- Closest neighbour is a
|1 |c |23 |24 |2.82 | <- Closest neighbour is b
|2 |d |45 |50 |1.41 | <- Closest neighbour is e
|2 |e |46 |51 |1.41 | <- Closest neighbour is d
|2 |f |41 |46 |2.82 | <- Closest neighbour is d
--------------------------------------------
In the expected output example I've calculated the Euclidean distance but I need the Haversine distance (using ST_DISTANCE in bigquery) it's just that the Euclidean is easier to calculate by hand.
My table is a sample of the actual data and contains ~500k rows. I can do a full outer join with this sample table but the full table has ~30m rows so this approach is not feasible.

How to select rows based on exact count of array elements in a different column

Suppose I have a dataframe like this, where B_C is concat of col B and col C, and column selected_B_C is an array formed by picking a few B_C col from within the group.
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+
|A |grp_count_A|B |C |B_C |D |selected_B_C |
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+
|1 |6 |30261.41|20091201|30261.41_20091201|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |30261.41|20081201|30261.41_20081201|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |39879.85|20080601|39879.85_20080601|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |69804.42|20080117|69804.42_20080117|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |99950.3 |20090301|99950.3_20090301 |99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |99999.23|20080118|99999.23_20080118|99945.83|[30261.41_20091201, 39879.85_20080601]|
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|
|2 |4 |351378.0|20180620|351378.0_20180620|183600.0|[[76498.0_20150501, 76498.0_20150501]]|
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+
I want to append a column selected where it takes a value 1, if for a row, col B_C is found in colselected_B_C, otherwise 0, so the final dataframe looks like this.
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+--------+
|A |grp_count_A|B |C |B_C |D |selected_B_C |selected|
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+--------+
|1 |6 |30261.41|20081201|30261.41_20081201|99945.83|[30261.41_20091201, 39879.85_20080601]|0 |
|1 |6 |30261.41|20091201|30261.41_20091201|99945.83|[30261.41_20091201, 39879.85_20080601]|1 |
|1 |6 |39879.85|20080601|39879.85_20080601|99945.83|[30261.41_20091201, 39879.85_20080601]|1 |
|1 |6 |69804.42|20080117|69804.42_20080117|99945.83|[30261.41_20091201, 39879.85_20080601]|0 |
|1 |6 |99950.3 |20090301|99950.3_20090301 |99945.83|[30261.41_20091201, 39879.85_20080601]|0 |
|1 |6 |99999.23|20080118|99999.23_20080118|99945.83|[30261.41_20091201, 39879.85_20080601]|0 |
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|1 |
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|1 |
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|0 |
|2 |4 |351378.0|20180620|351378.0_20180620|183600.0|[[76498.0_20150501, 76498.0_20150501]]|0 |
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+--------+
The tricky part for col selected is that I only want the exact number of occurrences of a value in selected_B_C to have value 1 for selected
For example in group 2, even though there are 3 records with value of 76498.0_20150501 for col B_C, I want only two records from group 2 whose value is 76498.0_20150501 to have value of 1 for selected, as selected_B_C for group 2 has exactly 2 elements with value 76498.0_20150501 in col selected_B_C

transform table with duplicate

I'm trying to transform a base with duplicates into a new base according to the attached model
impossible without duplicate
I don't see how I can do
in advance thank you for your help
original base
IDu| ID | Information
1 |A |1
2 |A |2
3 |A |3
4 |A |4
5 |A |5
6 |B |1
7 |B |2
8 |B |3
9 |B |4
10 |C |1
11 |D |1
12 |D |2
13 |D |3
base to reach
ID | Resultat/table2 | plus grand valeur
A |(1,2,3,4,5) |5
B |(1,2,3,4) |4
C |(1) |1
D |(1,2,3) |3
You can use GROUP_CONCAT
(https://www.w3resource.com/mysql/aggregate-functions-and-grouping/aggregate-functions-and-grouping-group_concat.php):
SELECT
ID, GROUP_CONCAT(INFORMATION), COUNT(INFORMATION)
FROM
TABLE
GROUP BY
ID
a huge thank you.
Quick and perfect response
on the other hand how I can filter to have the greatest value
this query ranges from smallest to largest, but how to keep only the largest value
D | Resultat/table2 | greatest value
A |(1,2,3,4,5) |5
B |(1,2,3,4) |4
C |(1) |1
D |(1,2,3) |3
I tried, but without success
SELECT ID,GROUP_CONCAT(ID1)
from tournee_reduite
GROUP BY ID
ORDER BY MAX(ID1) desc;
another huge thank you

Select rows with different values on different columns

I'm new to SQL so this took my a long time without being able to figure it out.
My table looks like this:
+------+------+------+
|ID |2016 | 2017 |
+------+------+------+
|1 |A |A |
+------+------+------+
|2 |A |B |
+------+------+------+
|3 |B |B |
+------+------+------+
|4 |B |C |
+------+------+------+
I would like to have only the rows which have changed from 2016 to 2017:
+------+------+------+
|ID |2016 | 2017 |
+------+------+------+
|2 |A |B |
+------+------+------+
|4 |B |C |
+------+------+------+
Could you please help ?
select * from mytable where column_2016<>column_2017
assuming your column labels are column_2016 and column_2017