Ordering a list SQL or Excel - sql

I have a simple task to do at work almost daily and I think it can be done easily with some help. There is a table with a single column "PC name". I have to divide the list of PC's into waves.
Wave 1 : 2%
wave 2: 3%
wave 3: 25%
wave 4: 45%
wave 5: 25%
So what I usually do is to copy the list of PC's into Excel and add a column named "wave assign". So for example if the list is 100 PC's first two PC's will be assign to Wave 1, three PCs towave 2, 25 PCs to wave 3 and so on.
I need a way to automate this since it takes me too long to do it manually. It doesn't matter if there is a small change in the % in order to round up the number of PCs in each wave.

Assuming the list is in ColumnA starting in Row1:
=VLOOKUP(ROWS(A$1:A1)/COUNTA(A:A),wArray,2)
in Row1 and copied down should work, provided a lookup array of the following kind is created:
and named wArray.
In case the list is shorter than 100 I have added .002 to the 'logical' breakpoints (cumulative proportions) so that it is not the minority waves that are rounded down such that, at say 50 items, Wave 1 does not feature (and hence stand out rather more than an approximation in a larger group).

Related

Calculate variable based on values of another variable

I'm learning SPSS for a research methods class but I'm a bit confused on how to enter and define values that represent a meaning after a certain point. For example, the problem I am working on states:
Suppose the following indexed scores represent performance on a new survey meant to understand an individual’s level of depression. Suppose a score of above 20 represents a depressed individual based on the survey design.
Scores: 13.5, 15.7, 14.3, 16.7, 21.2, 20.7, 22.3, 17.4, 16.8, and 12.4
What is the relative frequency of those individuals that represent depressed individuals?
How would one define or make the values over 20 marked as "depressed" to accurately calculate the relative frequency?
Please and thank you!
Picture of the variables and question
You need to calculate a new variable based on the existing one which will have a value of 1 if the original variable is over the threshold, and 0 if not. There are a few ways to do that, this is the simplest one:
compute depressed=(score>-20).
You can now add labels and analyse:
value labels depressed
1 "depressed: score over 20"
0 "score not over 20".
frequencies depressed.

Multi-dimensional dataframe or multiple 2D dataframes

A colleague wrote some code to create a price lookup table for products where the prices change throughout the year. He also stores other information like the name of the season, when it starts, ends, etc. His code takes nine minutes to run on a beefy machine.
His approach is the traditional SQL-loop-over-records algorithms. I wanted to see if I could do better using matrices, so I wrote a price table (of only prices) using Pandas. My code runs in 21 seconds on a Macbook Air. Cool.
My next step is to add in other attributes like name of the season, when it starts, ends, etc. It's my understanding that I shouldn't store objects in my dataframes because that will reduce speed, is Bad Practice, etc.
I think I have two options: 1. for each new piece of data add another dimension so the shape of my dataframe would go from (product X days) to (product X days X season_name X season_start X season_end) or 2. I would just create a new dataframe for each attribute and jump back and forth between them as necessary.
My goal is to use pandas for very quick lookups and calculations of data.
Or is there a better more pandas-ish way to do this?

Import data from csv into database when not all columns are guaranteed

I am trying to build an automatic feature for a database that takes NOAA weather data and imports it into our own database tables.
Currently we have 3 steps:
1. Import the data literally into its own table to preserve the original data
2. Copy it's data into a table that better represents our own data in structure
3. Then convert that table into our own data
The problem I am having stems from the data that NOAA gives us. It comes in the following format:
Station Station_Name Elevation Latitude Longitude Date MXPN Measurement_Flag Quality_Flag Source_Flag Time_Of_Observation ...
Starting with MXPN (Maximum temperature for water in a pan) which for example is comprised of it's column and the 4 other columns after it, it repeats that same 5 columns for each form of weather observation. The problem though is that if a particular type of weather was not observed in any of the stations reported, that set of 5 columns will be completely omitted.
For example if you look at Central Florida stations, you will find no SNOW (Snowfall measured in mm). However, if you look at stations in New Jersey, you will find this column as they report snowfall. This means a 1:1 mapping of columns is not possible between different reports, and the order of columns may not be guaranteed.
Even worse, some of the weather types include wild cards in their definition, e.g. SN*# where * is a number from 0-8 representing the type of ground, and # is a number 1-7 representing the depth at which soil temperature was taken for the minimum soil temperature, and we'd like to collect these together.
All of these are column headers, and my instinct is to build a small Java program to map these properly to our data set as we'd like it. However, my superior believes it may be possible to have the database do this on a mass import, but he does not know how to do it.
Is there a way to do this as a mass import, or is it best for me to just write the Java program to convert the data to our format?
Systems in use:
MariaDB for the database.
Centos7 for the operating system (if it really becomes an issue)
Java is being done with JPA and Spring Boot, with hibernate where necessary.
You are creating a new table per each file.
I presume that the first 6 fields are always present, and that you have 0 or more occurrences of the next 5 fields. if you are using SQL Server i would approach it as follows,
Query the information_schema catalog to get a count of the fields in
the table. If the count= 6 then no observations are present, if 11
columns ,then you have 1 observation, if 17 then you have 2
observations, etc.
Now that you know the number of observations you can write some SQL
that will loop the over the observations and insert them into a
child table with a link back to a parent table which has the 1st 6
fields.
apologies if my assumptions are way off.
-HTH

search Big data table

I have a table with 10 million records. Each record indicates one person. Each record has person_id, latitude, longitude, postal-code. I want to pick one query and tell how many other people in 10 miles radius (Distance can be calculated from Latitudes and Longitudes). Searching 10 million records and calculating distance to check if inside 10 million is not a good way. So, I will search only in neighboring postal codes(I will get it somehow). How can I search entry having specific postal code(not all 10 million records)?
Why not take lat/long and create a box extending 10 miles in all four directions first?
Then issue a query looking for people with lat/long in that box. Use a WHERE that does
x > xLess10 and x < xPlus10 and y > yLess10 and y < yPlus10
Now you have a smaller list and you can calculate the actual distance with something similar to sqrt((x1 - x2)^2 + (y1 - y2)^2) for that smaller list. But it has to work on a sphere, not a grid marked off in miles.
You can try adding a and zip in (555555, 555556, etc) to see if that runs faster or not. A precomputed list of all other zip codes with locations within 10 miles of anywhere within a zip code would be pretty easy to set up in another table.
#Randy made a comment that made me realize that this doesn't work very well for locations within 10 miles of the north and south poles. Maybe that doesn't matter because the population is pretty small up there. Or use another method of just getting everyone within a cirle around the pole and 10 miles south (or north) or the x,y location.
Also, you have to find a way to convert from lat/long to miles. The longitudinal lines get closer together the farther you are from the equator.

Optimization – Rearranging items of a specific type?

Optimization – Rearranging items of a specific type?
Here’s a little optimization problem for a personal project I’m working on: so imagine you have many boxes of an item, and within each box, you know that the items must be shipped to various locations.
Example:
Box 1: 1000 items, 500 to location A, 500 to location B.
Box 2: 1000 items, 500 to location A, 500 to location B.
I want to be able to rearrange the items such that I can ship as many full boxes as I can to one destination. For instance, in the example above, I would be able to rearrange the items such that new_box 1 has 1000 items to location A, and 1000 to location B.
Now you can imagine that this perfect rearrangement case will not always occur. What if I had:
Box 1: 1000 items, 500 to location A, 300 to location B, 200 to location C.
Box 2: 1000 items, 500 to location A, 500 to location B.
Then, I would want to (a) optimize the number of full boxes to one destination, & (b) minimize the number of different locations in every other box. For instance, having 3 boxes each with 2 different destinations would be better than having 3 boxes each with 3 different destinations. The optimum rearrangement of the second example above would be:
New_box_1: 1000 items, 1000 to location A.
New_box_2: 1000 items, 800 to location B and 200 to location C.
My question is: How would I handle this situation for arbitrary number of boxes and arbitrary number of destinations per box? For the sake of the problem, let’s start with assuming that each box has the same number of items.
What I’m thinking right now is to take a greedy approach:
Go down the line for each subsequent box and keep a running sum of the destinations and the number of items that will be shipped to it.
If any of these sum to a value that is greater than the box capacity, place these items in their own box.
Then, take the highest number of items to one destination that is remaining, call this “x”, take the value (box capacity – x), (call this “y”), and find the value of the number of items to one destination that is both ABOVE “y” and the closest to “y”.
Then, place “x” items to the first destination and “y items to the second destination in another box, and repeat.
Any other suggestions, or insights? Thanks so much.
I think this is a simple 1d bin-packing problem. The problem is a box with 1000 items at most. Then you can use a bin-packing algorithm like best-fit, first-fit or so.