Trim numbers to n digits digits in a data frame - dataframe

> head(Conversion_tbl)
X.Product.Code BEC.Product.Code
1 10121 41
2 10129 111
3 10130 41
4 10190 111
5 10221 41
6 10229 111
> tail(Conversion_tbl)
X.Product.Code BEC.Product.Code
5381 970200 61
5382 970300 61
5383 970400 61
5384 970500 61
5385 970600 61
5386 999999 7
I have this df, I need to:
1) transform 2nd variable numbers to one digit, keeping only their first one (61 -> 6)
2) keep only first 2 digits in 1st variable (ex. 970200 -> 97). Note that first 561 observations are one digits shorter than others, hence I need only their first digit, preceded by a "0" (ex. 10121 -> 01, 84022 -> 08)
desired output:
X.Product.Code BEC.Product.Code
1 01 4
5381 97 6

Related

How to I count a range in sql?

I have a data that looks like this:
$ Time : int 0 1 5 8 10 11 15 17 18 20 ...
$ NumOfFlights: int 1 6 144 91 504 15 1256 1 1 578 ...
Time col is just 24hr time. From 0 up all the way until 2400
What I hope to get is:
hour | number of flight
-------------------------------------
1st | 240
2nd | 223
... | ...
24th | 122
Where 1st hour is from midnight to 1am, and 2nd is 1am to 2am, and so on until finally 24th which is from 11pm to midnight. And number of flights is just the total of the NumOfFlights within the range.
I've tried:
dbGetQuery(conn,"
SELECT
flights.CRSDepTime AS Time,
COUNT(flights.CRSDepTime) AS NumOnTimeFlights
FROM flights
GROUP BY CRSDepTime/60
")
But I realise it can't be done this way. The results that I get will have 40 values for time.
> head
Time NumOnTimeFlights
1 50 6055
2 105 2383
3 133 674
4 200 446
5 245 266
6 310 34
> tail
Time NumOnTimeFlights
35 2045 48136
36 2120 103229
37 2215 15737
38 2245 36416
39 2300 15322
40 2355 8018
If your CRSDepTime column is an integer encoded time like HHmm then CRSDepTime/100 will extract the hour.
SELECT
CRSDepTime/100 AS hh,
COUNT(flights.CRSDepTime) AS NumOnTimeFlights
FROM flights
GROUP BY CRSDepTime/100

Display rows where multiple columns are different

I have data that looks like this. Thousands of rows returned, but this is just a sample.
Most days have the same numbers in them, but some do not. Note that ID 1 and 5 have identical numbers every day.
ID
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
1
26
26
26
26
26
26
26
2
44
44
30
30
44
44
44
3
55
55
55
55
80
90
55
4
12
12
43
43
43
43
43
5
36
36
36
36
36
36
36
I'd like to only return rows where the days of the week have different numbers.
In this case, the only IDs returned should be 2, 3 & 4.
What would I want this query to look like?
Thanks!
One idea that should work in most RDBMS (with some syntax tweaks) is the following.
This is SQL Server compatible: pivot the days into rows and count the distinct values and filter accordingly:
select id
from t
cross apply (
select Count(distinct d) from (
values(sunday),(monday),(tuesday),(wednesday),(thursday),(friday),(saturday)
)d(d)
)d(v)
where d.v>1

how to concat corresponding rows value to make column name in pandas?

I have the below dataframe has in a messy way and I need to club row 0 and 1 to make that as columns and keep rest rows from 3 asis:
Start Date 2005-01-01 Unnamed: 3 Unnamed: 4 Unnamed: 5
Dat an_1 an_2 an_3 an_4 an_5
mt mt s t inch km
23 45 67 78 89 9000
change to below dataframe :
Dat_mt an_1_mt an_2 _s an_3_t an_4_inch an_5_km
23 45 67 78 89 9000
IIUC
df.columns=df.loc[0]+'_'+df.loc[1]
df=df.loc[[2]]
df
Out[429]:
Dat_mt an_1_mt an_2_s an_3_t an_4_inch an_5_km
2 23 45 67 78 89 9000

SQL query between and equals

there are three tables, first table name is baseline which contains all beneficiaries information and one column in the name of PPI Score and the second table in the name of PPI_SCORE_TOOKUP which contains six columns as below the third table in the name of endline which contains beneficiaries end line assessment data and also one column in the name of PPI_Score, what i want is, to join some how these tables however there is no foreign key of the baseline and endline table in the PPI_SCORE_TOOKUP table there is only PPI_Score in the tables PPI_SCORE_TOOKUP, endline and endline tables, and i want to query to show some baseline data along PPI result if the values of the ppi in the basline table is between or equals to PPI_SCORE_START and PPI_SCORE_END and also it should show endline data of the same member along with the PPI Score with its six column if ppi score in the endline table is between and equals to PPI_SCORE_START and PPI_SCORE_END all in one row.
Note: i did not try any query yet since i did not have any idea how to do this, but i expect the expected result in the bottom of this question.
Tables are as follows
baseline table
ID NAME LAST_NAME DISTRICT PPI_SCORE
1 A A A 10
2 B B B 23
3 C C C 90
4 D D D 47
endline table
baseline_ID Enterprise Market PPI_SCORE
3 Bee Keeping Yes
2 Poultry No 74
1 Agriculture Yes 80
PPI_SCORE_TOOKUP table
ppi_start ppi_end national national_150 national_200 usaid
0 4 100 100 100 100
10 14 66.1 89.5 96.5 39.2
5 9 68.8 90.2 96.7 44.4
15 19 59.5 89.1 97.2 35.2
20 24 51.3 85.5 96.4 28.8
25 29 43.5 81.1 93.2 20
30 34 31.9 74.5 90.4 13.6
35 39 24.6 66.9 87.3 7.9
40 44 15.2 58 82.8 4.5
45 49 11.4 47.9 73.4 4.2
50 54 6 37.2 68.4 2.6
55 59 2.7 26.1 61.3 0.5
60 64 0.9 21 50.4 0.5
65 69 0 14.3 37.1 0
70 74 3 14.3 29.2 0
75 79 0 1.4 5.1 0
80 84 0 0 9.5 0
85 89 0 0 15.2 0
90 94 0 0 0 0
95 100 0 0 0 0
Expected Result
Your query can be made in the following way:
SELECT *
FROM baseline b
LEFT JOIN endline e ON b.id = e.baseline_ID
LEFT JOIN PPI_SCORE_TOOKUP ppi ON b.PPI_SCORE BETWEEN ppi.ppi_start AND ppi.ppi_end
LEFT JOIN PPI_SCORE_TOOKUP ppi2 ON e.PPI_SCORE BETWEEN ppi2.ppi_start AND ppi2.ppi_end
This matches your id's from the baseline table with the baseline_ID's from the endline table, keeping possible null values from baseline. It then matches the PPI_SCORE from baseline with ppi_start and ppi_end from PPI_SCORE_TOOKUP. Then we join the PPI_SCORE from endline with and ppi_end.
By replacing * with whatever fields you want to have.
See fiddle for a working example

Oracle - Group By Creating Duplicate Rows

I have a query that looks like this:
select nvl(trim(a.code), 'Blanks') as Ward, count(b.apcasekey) as UNSP, count(c.apcasekey) as GRAPH,
count(d.apcasekey) as "ANI/PIG",
(count(b.apcasekey) + count(c.apcasekey) + count(d.apcasekey)) as "TOTAL ACTIVE",
count(a.apcasekey) as "TOTAL OPEN" from (etc...)
group by a.code
order by Ward
The reason I have nvl(trim(a.code), 'Blanks') as Ward is that sometimes a.code is a blank string, sometimes it's a null.
The problem is that when I use the Group By statement, I can't use Ward or I get the error
Ward: Invalid Identifier
I can only use a.code so I get 2 rows for 'Blanks', as per below
1 Blanks 7 0 0 7 7
2 Blanks 23 1 1 25 30
3 W01 75 4 0 79 91
4 W02 62 1 0 63 72
5 W03 140 2 0 142 162
6 W04 6 1 0 7 7
7 W05 46 0 1 47 48
8 W06 322 46 1 369 425
9 W07 91 0 1 92 108
10 W08 93 2 0 95 104
11 W09 28 1 0 29 30
12 W10 25 0 0 25 28
What I need, is for the row with 'Blanks' to combined into 1 row. Little help?
Thanks.
You can not use the alias in the GROUP BY, but you can use the expression that builds the value:
GROUP BY nvl(trim(a.code), 'Blanks')