Pig: how to loop through all fields/columns? - apache-pig

I'm new to Pig. I need to do some calculation for all fields/columns in a table. However, I can't find a way to do it by searching online. It would be great if someone here can give some help!
For example: I have a table with 100 fields/columns, most of them are numeric. I need to find the average of each field/column, is there an elegant way to do it without repeat AVERAGE(column_xxx) for 100 times?
If there's just one or two columns, then I can do
B = group A by ALL;
C = foreach B generate AVERAGE(column_1), AVERAGE(columkn_2);
However, if there's 100 fields, it's really tedious to repeatedly write AVERAGE for 100 times and it's easy to have errors.
One way I can think of is embed Pig in Python and use Python to generate a string like that and put into compile. However, that still sounds weird even if it works.
Thank you in advance for help!

I don't think there is a nice way to do this with pig. However, this should work well enough and can be done in 5 minutes:
Describe the table (or alias) in question
Copy the output, and reorgaize it manually into the script part you need (for example with excel)
Finish and store the script
If you need to be able with columns that can suddenly change etc. there is probably no good way to do it in pig. Perhaps you could read it in all columns (in R for example) and do your operation there.

Related

Organising csv. file data in Python

I am quite a beginner with Python but I have a programming-related project to work on, so, I really would like to ask some help. I didnĀ“t find many simple solutions to organize the data such a way that I could do some analysis with that.
First, I have multiple csv-files, which I read in as DataFrame objects. In the end, I need them all to analyze together (right now the files are separated to the list of DataFrames but later on I probably will need those as one DataFrame object).
However, I have a problem with organizing and separating the data. These are thousands of rows in one column, a part of it is presented:
CIP;Date;Hour;Cons;REAL/ESTIMATED
EN025140855608477018TC2L;11/03/2020;1;0 057;R
EN025140855608477018TC2L;11/03/2020;2;0 078;R
EN025140855608477018TC2L;11/03/2020;3;0 033;R
EN025140855608477018TC2L;11/03/2020;4;0 085;R
EN025140855608477018TC2L;11/03/2020;5;0 019;R
...
EN025140855608477018TC2L;11/04/2020;20;0 786;R
EN025140855608477018TC2L;11/04/2020;21;0 288;R
EN025140855608477018TC2L;11/04/2020;22;0 198;R
EN025140855608477018TC2L;11/04/2020;23;0 728;R
EN025140855608477018TC2L;11/04/2020;24;0 275;R
The area, where the huge space between, the number should be merged together, for example, 0.057, which information represents "Cons" (actually it is the most important information).
I should be able to split data into 5 columns in order to proceed with the analysis. However, it should be a universal tool for different csv-files without knowing the including symbols. But the structure of the content and the heading is always the same.
I would be happy if anyone might know to recommend a way to work with this kind of data.
Sounds like what you are trying to do is convert the Cons column so that the spaces become a dot.
df = pd.read_csv("file.txt", sep=";")
df['Cons'] = df['Cons'].str.replace("\s+",".")
df['Cons'].head()
Output:
0 0.057
1 0.078
2 0.033
3 0.085
4 0.019

How to partition by in pandas and output to a word doc?

I have a table I have filtered from data. It is my highlights across the web. I want to, ultimately, output these to a doc file I have by the page they came from
I have the api data filtered down to two columns
url|quote
How do I, for each url, output the quote to a doc file. or just for starters iterate through a set of quotes by each earl.
In SQL it would be something like this
Select quote over(partition by url) as sub_header
From table
url quote
https://jotengine.com/transcriptions/WIUL8HBabqxffIDOkUA9Dg I actually think that the bigger problem is not necessarily having the ideas. I think everyone has lots of interesting ideas. I think the bigger problem is not killing the bad ideas fast enough. I have the most respect for the Codecademy founders in this respect. I think they tried 12 ideas in seven weeks or something like that, in the summer of YC.
https://jotengine.com/transcriptions/WIUL8HBabqxffIDOkUA9Dg We were like what the heck is going on here so we went and visited five of our largest customers in New York, this was about three years ago and we said okay, you're using the S3 integration but what the heck are you using it for? For five out of five customers in a row, they said well we have a data engineering team that's taking data from the S3 bucket, converting it into CS view files and managing all the schema-translations and now they're uploading it into a data warehouse like Redshift. The first time I heard that from a customer, I was like okay, that's interesting
I want to output a url header followed by all the quotes I've highlighted. Ideally my final product will be in docx
it would be great if you could provide some source code to help explain your problem. From looking at your question, I would say all you need to do is put your columns into a DataFrame, then export this to excel.
df = pd.DataFrame({"url":url,"quote":quote})
df["quote"].to_excel("filename.xlsx")
Hope this helps.

VBA: Efficient Vlookup from another Workbook

I need to do a Vlookup from another workbook on about 400000 cells with Vba. These cells are all in one Column.And shall be written into one Column. I know already , how the Vlookup Works, but my runtime is much to high by using autofill. Do you have an Suggestion how i can approve it?
Dont use VLookup use Index Match: http://www.randomwok.com/excel/how-to-use-index-match/
If you are able to adjust what the data looks like a slight amount, you may be interested in using a binary search. Its been a while since I last used one (writing a code for group exercise check-in program). https://www.khanacademy.org/computing/computer-science/algorithms/binary-search/a/implementing-binary-search-of-an-array , was helpful in setting up the idea behind it.
If you are able to sort them in an order, say by last name (im not sure of what data you are working with) then add an order of numbers to use for the binary search.
Edit:
The reasoning for a binary search would be that with a binary search is that the computational time it takes. The amount of iterations it would take is log2(400000) vs 400000. So instead of 400000 possible iterations, it would take at most 19 times with a binary search, as you can see with the more data you use the binary search would yield much quicker times.
This would only be a beneficial way if you are able to manipulate the data in such a way that would allow you to use a binary search.
So, if you can give us a bit more background on what data you are using and any restrictions you have with that data we would be able to give more constructive feedback.

Way to optimise a mapping on informatica

I would like to optimise a mapping developped by one of my colleague and where the "loading part" (in a flat file) is really really slow - 12 row per sec
Currently, to get to the point where I start writting in my file, I take about 2 hours, so I would like to know where I should start looking first otherwise, I will need at least 2 hours between each improvment - which is not really efficient.
Ok, so to describe simply what is done :
Oracle table (with big query inside - takes about 2 hours to get a result)
SQ
2 LKup on ref table (should not be heavy)
update strategy
1 transformer
2 Lk up (on big table - that should be one optimum point I guess : change them to joiner)
6 stored procedure (these also seem a bit heavy, what do you think ?)
another tranformer
load in the flat file
Can you confirm that either the LK up or the stored procedur part could be the reason why it is so slow ?
Do you think that I should look somewhere else to optimize ? I was thinking may be only 1 transformer.
First check the logs carefuly. Look at the timestamps. It should give you initial idea what part causes delay.
Lookups to big tables are not recommended. Joiners are a better way, but they still need to cache data. Can you limit the data for cache, perhaps? It'll be very hard to advise without seeing it.
Which leads us to the Stored Procedures: it's simply impossible to tell anything about them just like that.
So: first collect the stats and do log analysis. Next, read some tuning guides on the Net - there's plenty. Here's a more comprehensive one, but well... large - so you might like to try and look for some other ones.
Powercenter Performance Tuning Guide

How to delete multiple rows in for wxDataviewCtrl and wxDataViewVirtualListModel

I am using the wxDataviewCtrl and wxDataViewVirtualListModel to show the a long list of data, the wxDataViewVirtualListModel has 3 wxArrayString to store the data.
Currently when I want to delete a row, I will delete the data in 3 wxArrayString and call RowDelete(row) to notify the wxDataViewCtrl.
However, when I want to delete hundreds of rows I need to use a loop to delete them which is very slow.
How can I delete multiple rows faster?
Thank you
Sorry to dig up an old thread, but this pops to the top of the search, and I may have a solution that will help. The wxDataView example doesn't exactly show how to clear the entire list. Here is how I did it, and it seems very fast:
In your derived wxDataViewVirtualListModel class, add a function to clear all the column data out of your model. Like this:
void Clear(){
m_myDescriptionColValues.clear();
m_myNumberColValues.clear();
m_myFooColValues.clear();
Reset(0); // This is like DeleteRows(), but better.
}
In the wxDataView sample, this would go in the MyListModel class. Call this function when you want to clear out the model and repopulate the control with fresh data. It's really fast in my program with several hundred items.
At the very least, you should use a single RowsDeleted() call instead of multiple RowDeleted(). You could also use a more efficient representation than 3 parallel arrays, although I seriously doubt it's the bottleneck for just a few hundreds of rows -- but as usual, you need to profile to find out whether this is really [not] the case.