Getting summaries by subgroups of your data {plyr}
highest profit margin (as point data)
highestMargin <- max(companiesData$margin)
highestMargin <- companiesData[companiesData$margin == max(companiesData$margin),]
highestMargin <- subset(companiesData, margin==max(margin))
the highest margin (as dataframe)
- NULL as the second argument for factors to split by
highestProfitMargin <- ddply(companiesData, NULL, summarize, bestMargin = max(margin))
the highest profit margin for each company? (add new column(s))
myResults <- ddply(companiesData, 'company', transform, highestMargin = max(margin), lowestMargin = min(margin))
highestProfitMargins <- ddply(companiesData, 'company', summarize, bestMargin = max(margin))
highestProfitMargins <- ddply(companiesData, 'company', transform, bestMargin = max(margin))
just the (entire) rows that have the highest profit margins.
highestProfitMargins <- ddply(companiesData, 'company', function(x) x[x$margin==max(x$margin),])
plyr::ddplr(): Input a data frame and get a data frame back.
ddply(mydata, c('column name of a factor to group by', 'column name of the second factor to group by'), summarize OR transform, newcolumn = myfunction(column name(s) I want the function to act upon))
- the first argument
- the name of the original data frame
- the second argument
- the name of the column or columns you want to subset your data by.
- No parentheses are needed for just one factor if you're using quotation marks
- NULL
- select the row with highest margin in the entire data set
- the name of the column or columns you want to subset your data by.
- The third argument
- tells ddply() whether to return
- summarize
- doesn't give any information from other columns in the original data frame
- transform
- return your existing data frame with a new column
- summarize
- tells ddply() whether to return
- the fourth argument
- names the new column and then
- lists the function you want ddply() to use.
highest profit margin
highestMargin <- max(companiesData$margin)
highestMargin <- companiesData[companiesData$margin == max(companiesData$margin),]
highestMargin <- subset(companiesData, margin==max(margin))
the highest profit margin for each company?
- applying a function by groups
- what R calls factors.
- "split-apply-combine" (Hadley Wickham {plyr})
- Split up your data set by one or more factors,
- apply some function,
- then combine the results back into a data set.
- plyr package created by Hadley Wickham
- group of "ply" functions
- ddply(): Input a data frame and get a data frame back.
- alply(): input an array and get back a list
- ldply(): input a list and get back a data frame
- group of "ply" functions
- applying a function by groups
install & load
install.packages("plyr") library("plyr")
format of plyr::ddplr() 1: names in quotes
ddply(mydata, c('column name of a factor to group by', 'column name of the second factor to group by'), summarize OR transform, newcolumn = myfunction(column name(s) I want the function to act upon))
- the first argument
- the name of the original data frame
- the second argument
- the name of the column or columns you want to subset your data by.
- No parentheses are needed for just one factor if you're using quotation marks
- NULL
- select the row with highest margin in the entire data set
- the name of the column or columns you want to subset your data by.
- The third argument
- tells ddply() whether to return
- summarize
- doesn't give any information from other columns in the original data frame
- transform
- return your existing data frame with a new column
- summarize
- tells ddply() whether to return
- the fourth argument
- names the new column and then
- lists the function you want ddply() to use.
highestProfitMargins <- ddply(companiesData, 'company', summarize, bestMargin = max(margin))
highestProfitMargins <- ddply(companiesData, 'company', transform, bestMargin = max(margin))
- apply more than one function at a time
myResults <- ddply(companiesData, 'company', transform, highestMargin = max(margin), lowestMargin = min(margin))
- just the (entire) rows that have the highest profit margins.
highestProfitMargins <- ddply(companiesData, 'company', function(x) x[x$margin==max(x$margin),])
function(x)
an anonymous (unnamed, ad-hoc 特別指定) function is coming next
- extracting a subset of x
- x refers to the data frame that was passed into the anonymous function
- the anonymous function is being passed into a ddply() statement that's splitting the data frame by company, what's returned is the matching row(s) for each company.
- extracting a subset of x
[x$margin==max(x$margin),]
- I want to match every row where x$margin equals the maximum of x$margin.
- The comma after x$margin==max(x$margin)
- tells R to return every column of those matching rows, since no columns were specified.
- As an alternative, we could seek to return only one or several of the columns instead of all of them.
companiesData[companiesData$margin==max(companiesData$margin)
- alone, without ddply\(\), gives the highest overall margin, not the highest margin for each company.
the highest margin in the entire data set
- NULL as the second argument for factors to split by
highestProfitMargin <- ddply(companiesData, NULL, summarize, bestMargin = max(margin))
format of plyr::ddplr() 2: dot before the column names
myresult <- ddply(mydata, .(column name of factor I'm splitting by, column name second factor I'm splitting by), summarize OR transform, newcolumn = myfunction(column name I want the function to act upon))
- To get the highest profit margins for each company
- splitting the data frame by only one factor
- company.
- Even if you've only got one factor,
- it needs to be in parentheses after that dot if you're using the dot
- Even if you've only got one factor,
- company.
- splitting the data frame by only one factor
highestProfitMargins <- ddply(companiesData, .(company), summarize, bestMargin = max(margin))