Can I assign column names to SQL query with dynamic columns? - sql

I'm a noob at SQL. Sorry if my title isn't correct; Here's the problem:
I have a QODBC [1] query for QuickBooks that generates Profit & Loss report with dynamic columns:
sp_report
ProfitAndLossStandard
show
"Text",
"RowType",
"Amount" <-- dynamic
parameters
DateFrom= ?,
DateTo= ?,
SummarizeColumnsBy='Day'
With the parameter SummarizeColumnsBy='Day' you get x number of Amount columns for x number of days between DateFrom & DateTo (inclusive).
The above query with three days range looks something like this (first three rows):
0
1
2
3
4
5
0
Ordinary Income/Expense
TextRow
366
None
None
None
1
Income
TextRow
366
None
None
None
2
4000 ยท Revenue
DataRow
366
[value]
[value]
[value]
...
Columns 3-5 are the Amount feature and they each show a summary value for a date in the range (only on row type == DataRow). (I also get this extra column, 2, not in the SQL request.)
I'm using Python and PyODBC to call QuickBooks. The resulting data is put into a Pandas DataFrame to represent the P&L in my script..
I can do some DataFrame carpentry to get the column names with Pandas, but is there a way to get the column names in the SQL syntax?
IE. 0 = "Text", 1 = "RowType", ...
[1]: In case you don't know, "QODBC is a fully functional ODBC driver for reading and writing QuickBooks ...accounting data files by using standard SQL commands."

Related

Finding the mean of a column; but excluding a singular value

Imagine I have a dataset that is like so:
ID birthyear weight
0 619040 1962 0.1231231
1 600161 1963 0.981742
2 25602033 1963 1.3123124
3 624870 1987 10,000
and I want to get the mean of the column weight, but the obvious 10,000 is hindering the actual mean. In this situation I cannot change the value but must work around it, this is what I've got so far, but obviously it's including that last value.
avg_num_items = df_cleaned['trans_quantity'].mean()
translist = df_cleaned['trans_quantity'].tolist()
my dataframe is df_cleaned and the column I'm actually working with is 'trans_quantity' so how do I go about the mean while working around that value?
Since you added SQL in your tags, In SQL you'd want to exclude it in the WHERE clause:
SELECT AVG(trans_quantity)
FROM your_data_base
WHERE trans_quantity <> 10,000
In Pandas:
avg_num_items = df_cleaned[df_cleaned["trans_quantity"] != 10000]["trans_quantity"].mean()
You can also replace your value with a NAN and skip it in the mean:
avg_num_items = df_cleaned["trans_quantity"].replace(10000, np.nan).mean(skipna=True)
With pandas, ensure you have numeric data (10,000 is a string), filter the values above threshold and use the mean:
(pd.to_numeric(df['weight'], errors='coerce')
.loc[lambda x: x<10000]
.mean()
)
output: 0.8057258333333334

Pandas pivot multiple columns, indexes, and values

I have the following dataset
resulted_by
follow_up_result
follow_up_number
#
%
0
User 1
good
1
30
30
1
User 2
good
2
65
65
2
User 3
bad
3
5
0.05
I want to Pivot:
follow up result and resulted by as indexes
follow up number as a column
# and % as values
pivot = df.head(3).pivot(columns=['follow_up_number'], values=["#", '%'], index=['follow_up_result', 'resulted_by'])
However, I want the follow up number to be above the values, here is how I achieved that:
pivot = df.head(3).pivot(columns=['follow_up_result', 'resulted_by'], values=["#", '%'], index=['follow_up_number'])
pivot = pivot.stack(level=0).T
Notice how I switch columns and indexes.
I want the column names to be at the same level as the values.
Is there a way to do that?
Is there a better way to achieve what I need without switching between columns and indexes?
Code Snippet:
https://onecompiler.com/python/3y5gzm7hu

Spark Dataframe : Group by custom Range

I have a Spark Dataframe which I aggregated based on a column called "Rank", with "Sum" beging the Sum of all values with that Rank.
df.groupBy("Rank").agg(sum("Col1")).orderBy("Rank").show()
Rank
Sum(Col1)
1
1523
2
785
3
232
4
69
5
126
...
....
430
7
Instead of having the Sum for every single value of "Rank", I would like to group my data into rank "buckets", to get a more compact output. For example :
Rank Range
Sum(Col1)
1
1523
2-5
1212
5-10
...
...
...
100+
....
Instead of having 4 different rows for Rank 2,3,4,5 - I would like to have one row "2-5" showing the sum for all these ranks.
What would be the best way of doing that ? I am quite new to Spark Dataframes and am thankful for any help and especially examples on how to achieve that
Thank you !
Few options:
Histogram - build a histogram. See the following post:
Making histogram with Spark DataFrame column
Add another column for the bucket values (See Apache spark dealing with case statements):
df.select(when(people("Rank Range") === "1", "1")
.when(..., "2-5")
.otherwise("100"))
Now you can run your group by query on the new Rank Range column.

Why can't access df data in pandas?

I have a table where column names are not really organized like they have different years of data with different column numbers.
So I should access each data through specified column names.
I am using this syntax to access a column.
df = df[["2018/12"]]
But when I just want to extract numbers under that column, using
df.iloc[0,0]
it throws an error like
single positional indexer is out-of-bounds
So I am using
df.loc[0]
but it has the column name with the numeric data.
How can I extract just the number of each row?
Below is the CSV data
Closing Date,2014/12,2015/12,2016/12,2017/12,2018/12,Trend
Net Sales,"31,634","49,924","62,051","68,137","72,590",
""
Net increase,"-17,909","-16,962","-34,714","-26,220","-29,721",
Net Received,-,-,-,-,-,
Net Paid,-328,"-6,038","-9,499","-9,375","-10,661",
Assuming you have the following data frame df imported from your csv:
Closing Date 2014/12 2015/12 2016/12 2017/12 2018/12
0 Net Sales 31,634 49,924 62,051 68,137 72,590
1 Net increase -17,909 -16,962 -34,714 -26,220 -29,721
2 Net Received - - - - -
3 Net Paid -328 -6,038 -9,499 -9,375 -10,661
then by doing df = df[["2018/12"]] you create a new data frame with one column and df.iloc[0,0] will work perfectly well here returning 72,590. I you wrote df = df["2018/12"] you'd create a new series and here df.iloc[0,0] will throw an error 'too many indexers', because it's a one-dimensional series.
Anyway, if you need the values of a series, use the values attribute (or to_numpy() for version 0.24 or later) to get the data as array or to_list() to get them as a list.
But I guess what you really want is to have your table transposed:
df = df.set_index('Closing Date').T
to the following more logical form:
Closing Date Net Sales Net increase Net Received Net Paid
2014/12 31,634 -17,909 - -328
2015/12 49,924 -16,962 - -6,038
2016/12 62,051 -34,714 - -9,499
2017/12 68,137 -26,220 - -9,375
2018/12 72,590 -29,721 - -10,661
Here, df.loc['2018/12','Net Sales'] gives you 72,590 etc.

Pandas: Date difference loop between columns with similiar names (ACD and ECD)

I'm working in Jupyter and have a large number of columns, many of them dates. I want to create a loop that will return a new column with the date difference between two similarly-named columns.
For example:
df['Site Visit ACD']
df['Site Visit ECD']
df['Sold ACD (Loc A)']
df['Sold ECD (Loc A)']
The new column will have a column df['Site Visit Cycle Time'] = date difference between ACD and ECD. Generally, it will always be the column that contains "ACD" minus the column that contains "ECD". How can I write this?
Any help appreciated!
The following code will do the following:
Find columns that are similar (over 90 ratio fuzz using fuzzywuzzy package)
Perform the date comparison (or time)
Avoid the same computation to be performed on both sides
get the name 'Site Visit' if the column is called more or less like that
get the name 'difference between 'column 1' and 'column 2' if it is called differently
I hope it helps.
import pandas as pd
from fuzzywuzzy import fuzz
name = pd.read_excel('Book1.xlsx', sheet_name='name')
unique = []
for i in name.columns:
for j in name.columns:
if i != j and fuzz.ratio(i, j) > 90 and i+j not in unique:
if 'Site Visit' in i:
name['Site Visit'] = name[i] - name[j]
else:
name['difference between '+i+' and '+j] = name[i] - name[j]
unique.append(j+i)
unique.append(i+j)
print(name)
Generally, it will always be the column that contains "ACD" minus the column that contains "ECD".
This answer assumes the column titles are not noisy, i.e. they only differ in "ACD" / "ECD" and are exactly the same apart from that (upper/lower case included). Also assuming that there always is a matching column. This code doesn't check if it overwrites the column it writes the date difference to.
This approach works in linear time, as we iterate over the set of columns once and directly access the matching column by name.
test.csv
Site Visit ECD,Site Visit ACD,Sold ECD (Loc A),Sold ACD (Loc A)
2018-06-01,2018-06-04,2018-07-05,2018-07-06
2017-02-22,2017-03-02,2017-02-27,2017-03-02
Code
import pandas as pd
df = pd.read_csv("test.csv", delimiter=",")
for col_name_acd in df.columns:
# Skip columns that don't have "ACD" in their name
if "ACD" not in col_name_acd: continue
col_name_ecd = col_name_acd.replace("ACD", "ECD")
# we assume there is always a matching "ECD" column
assert col_name_ecd in df.columns
col_name_diff = col_name_acd.replace("ACD", "Cycle Time")
df[col_name_diff] = df[col_name_acd].astype('datetime64[ns]') - df[col_name_ecd].astype('datetime64[ns]')
print(df.head())
Output
Site Visit ECD Site Visit ACD Sold ECD (Loc A) Sold ACD (Loc A) \
0 2018-06-01 2018-06-04 2018-07-05 2018-07-06
1 2017-02-22 2017-03-02 2017-02-27 2017-03-02
Site Visit Cycle Time Sold Cycle Time (Loc A)
0 3 days 1 days
1 8 days 3 days