Outline: Assignment 1 – Part B comprises three questions worth 15% of your final grade.
Total: 45 marks.
Instructions:
• Use only SAS in this assignment.
• Only documents in portable format (pdf) will be accepted. You can use, e.g., Word, knitr
or Sweave to create your report, as well as RStudio as editor of the source files.
• Formats other than PDF will be ignored and the author will be asked to re–submit the assignment within 24 hours after the due date & time at the cost of 5% of the total marks. If the assignment is not resubmitted within this time frame, then it will be assigned a mark of zero and deemed as non–submission.
• Any SAS code required to complete this assignment, especially the code to support your conclusions & answers, must be self-explanatory and must be embedded in the corresponding answer as text (not image). SAS code submitted in separate files will be ignored and not considered for marking.
• Optionally, you may submit only your answers and avoid copying & pasting each question in the PDF document. If this is the case, then just make reference to each question, e.g.,
Answer Question 1 (a), Answer Question 1 (b), … , etc.
• Read carefully – Answer all the questions as requested. Any material or information unrelated to the correct answer may result in a significant reduction of marks for that question.
• Several questions will come to light while solving these tasks. You may need to visit the SAS–support website for additional information about specific statements/steps to complete them.
• Don’t forget to fill in and sign the cover sheet which must be the very first page in the PDF.
Use, e.g., Adobe Acrobat Pro on Uni computers. Do not submit the cover sheet separately.
Question 1. The data set fitness.sas7bdat contains information from 30 women with low blood oxygen levels and /or difficult to breathe who took part of an experiment to test out three new ‘full body workout routines’ initially designed to improve women’s ability to breathe. The data set is described in Table 1.
The 30 women were randomly allocated to the three routines, labelled as A, B and C. In practice, the higher the oxygen consumption is, the more beneficial is the routine for the lung and heart functioning. Previous research on the topic highly suggests that oxygen consumption is associated (linearly) to body weight (variable ‘weight’). A 6-months mem- bership for Workout B costs 2.5 times the 6-months membership for Workout C. There is no fee to enrol in Workout A.
Table 1: Data set fitness.sas7bdat.
age = ‘Age in years’
weight = ‘Initial weight in kg’ runtime = ‘Minutes to run 1.5 miles’ rstpulse = ‘Heart rate while resting (BPM)’
runpulse = ‘Heart rate while running (BPM)’
maxpulse = ‘Maximum heart rate (BPM)’
oxygen = ‘Oxygen consumption in VO2 (Liters/minute)’
group = ‘Exercise workout Routine: A, B, or C.’
You have been asked to analyse the data and prepare a short report addressing the following two concerns:
a) To determine if, on average, the 3 workout routines produce different oxygen consump- tion levels. If not, then recommend a workout routine that can be deemed as best and most beneficial towards improving women’s ability to breathe.
b) The extent to which the women’s (initial) weight influences or affect the oxygen con- sumption across the three workout routines.
** In this question use a significant level of alpha= 10%. This means that the threshold to compare any p-value (from any model considered here) is 0.10 (not 0.05).
NOTE: Concerns a) and b) above must be addressed in Q1.4 below, as part of the Executive Summary. Don’t provide separate answers for a) and b). The questions carrying out the marks are Q1.1 to Q1.4 below.
1.- (1 mark + 3 marks justification) Which variables (out of Table 1) are involved in this analysis? Justify your answer.
2.- (1 marks + 3 marks – Justification -) Which statistical model (technique) would you recommend to deal with the above concerns? Justify your answer.
3.- (2 marks) Conduct an appropriate analysis in SAS. Write down the code employed here (‘copy and paste’ is fine). NOTE: You will need to generate your own SAS code. You are welcome to take (and modify as required) the code provided in the lectures (NO marks will be deducted).
4.- (8 marks) Write down an Executive Summary for this question. Remember, it must include a short discussion on the approach adopted to analyze the data, your findings as well as their relevance towards points a) and b) above.
Question 2. PD–L1 is a protein that inhibits immune cells’ attacks on non–harmfull cells in the body. Normally, our immune system fights strange virus or bacteria, without touching our own healthy cells. Some cancer cells, however, may have high amounts of PD–L1 (this is called PD–L1 expression), such that the cancer cells are capable to ‘trick’ our immune system and avoid being attacked (treated) as strange, harmful substances.
Consider the data set BORdata.csv. It contains information from 500 patients with PD– L1 expression who were treated with anti–PD–L1 agents. The expression of PD–L1 was measured via the following two methods:
a) Method TC – Percentage of tumor cells, and
b) Method IC – Percentage of tumor infiltrating immune cells.
The outcome from both methods are available in the BORdata.csv data set(columns IC and TC). A third method based on the amount of anti–bodies observed in the patient at the beginning of the treatment. This results is given in the column named AB.
The Best Overall Response (BOR) is the best response recorded from the start of the treatment until disease progression/recurrence or otherwise. For each of our 500 patients, the outcome from the BOR, from between Method TC, Method IC and AB, is given in column BOR, coded as ‘1’ or ‘0’, i.e., effective (1) or ineffective (0). Here, effective means the disease did not recur on the patient. Otherwise, the outcome is labeled as ineffective.
Why these results are important? ** because they will be used on new patients very soon to predict the efficacy of the BOR to detect the disease progression/recur- rence solely based on methods TC, IC and AB **.
OUR AIM IN THIS QUESTION: Based on the results given in the data set, we aim to predict if a new patient will develop the disease without actually observing the disease progression/occurrence, but merely based on the results from the TC, IC and AB tests.
More specifically, every new patient (assumed to have PD–L1 expression) will undergo the new anti–PD–L1 agents treatment whose effect is measured with the three methods TC, IC, and AB. Our objectives are A) to determine the method or methods (among TC, IC and AB) that effectively classify these patients according to BOR, and B) the optimum threshold/s producing such result/s.
1.- (0 marks) Run an appropriate analysis in SAS addressing this concern. All the data
(columns and rows, except perhaps by the ‘Patient ID’) must be used. You will need
to decide on, e.g., suitable statistical or machine learning methods. For instance, if you make use of regression models, then you need to state (and use) the ‘response’ variable and the ‘predictors’, etcetera. Also, e.g., If you employ ROC curves, then you will need to consider methods to compute optimum thresholds (Youden index or minimum–d), etc. Use SAS only.
2.- (15 marks) Write a professionally written research-report (min 1-page, max 2- pages) outlining all the details of the conducted analysis. Include relevant figures, plots, etc. Secondary/supplementary material, as well as your correctly commented SAS code MUST be presented in an Appendix. You can use the SAS snippet provided along with this Assignment – File: Question2.sas.
Also, discuss advantages as well as pitfalls or constraints with the approach adopted.
Resources available for this question (highly recommended to read in ad- vance):
R1. https://jitc.biomedcentral.com/articles/10.1186/s40425-019-0768-9 https://www.annalsofoncology.org/article/S0923-7534(19)60982-8/pdf
R2. SAS code File: Question2.sas. R3. File ExampleQ2-BOR.pdf.
Question 3 The data set sashelp.cars contains 428 observations and 15 variables. For the following questions use SAS only.
**NOTE: DON’T employ ROC curves in this question.
1. (1 mark) Run PROC CONTENTS to learn about each variable of this data set. Present your code as answer to this question.
2. (3 marks) Using GLMs only, investigate the ability of variables MPG(city), WEIGH (Lbs), and ENGINE SIZE to predict whether the car has its origin in Asia or not. Write a short paragraph (MAX 4 – 5 sentences) to address this point.
3. (3 marks) Using GAMs (generalized additive models) only, investigate the ability of variables MPG(city), WEIGH (Lbs), and ENGINE SIZE to predict whether the car has its origin in Asia or not. Write a short paragraph (MAX 4 – 5 sentences) to address this point.
4 (5 marks) Write down a short report (MAX 8 – 12 sentences) comparing the results from GLMs and GAMs. Discuss the advantages and/or limitations from each modeling framework. Finally, draw conclusions on whether GLMs or GAMs should be considered to predict the origin of this car (binary origin – Asian or Not-from-Asia).
Any piece of SAS code used in this question must be presented in an Appendix, as well as supplementary material.