Importance of a Comparison Group and a Long-Term Follow-Up Test in Evaluating Environmental Education Experiences

of a Comparison Group a Test This study evaluated the impacts on environmental literacy after a non-formal science-based program and compared the impacts to a non-formal non-science-based program. Both programs included children in grades six to eleven (ages 11 to 17) from the Syracuse, New York, USA area. Environmental literacy was assessed by administering environmental attitude and environmental knowledge pre-, post-, and follow-up tests to both programs ’ participants. Initially, environmental attitude scores were higher for the participants in the science-based program. However, this was not a lasting impact. According to the follow-up test, attitude scores were not elevated for the science-based program. Without the follow-up tests given weeks after the program end, we could have inferred environmental attitudes were increased by the science-based program. Environmental knowledge was higher at the end of the science-based program but also increased in the comparison group. The gains in environmental knowledge were sustained for several weeks, but differences between the two programs did not persist. Without the comparison group we could have inferred that environmental knowledge increased solely due to the science-based program. These results show incorporating both a comparison group and a follow-up assessment are necessary to properly evaluate the effectiveness of increasing environmental literacy from science-based programs.

Environmental education can inform people of these issues, what they can do to help, and why the environment is important (Bogner, 1998;Dehart Hurd, 1958;Pooley & O'Connor, 2000;Tucker & Izadpanahi, 2017). Environmental education comes in many forms, from formal school curricula to informal, spontaneous interactions (UNESCO, 1975;UNESCO & UNEP, 1978). Because of the lack of adoption of environmental topics in schools, non-formal experiences are inordinately important for increasing environmental literacy (Ozdemir, 2010).
Although non-formal education experiences are structured, they are also voluntary (UNESCO, 1993) and frequently occur in informal settings outside the classroom. These can come in many forms, such as, structured talks at a zoo, ranger programs in a park, or a science camp during the summer (UNESCO, 1993).
This study focused on a non-formal science summer program run by undergraduate students from the State University of New York College of Environmental Science and Forestry (ESF). This program, Summer Camps Investigating Ecology in Neighborhood and City Environments (called SCIENCE), was comprised of participants from the Syracuse area who ranged in age from 11 to 17.

Current State of Environmental Education Field
Though environmental education is not a new field, it is expanding quickly in its practice and theoretical construction (Ardoin et al., 2017). To ensure environmental educators have the most impact they must use all paths available to them, advocating for policy requiring environmental education, focusing on improving attitude and behavior as well as knowledge, and utilizing non-formal education experiences (Ardoin et al., 2017;Bischoff et al., 2008). While there has been an increasing number of studies on environmental education and environmental literacy conducted around the world, especially within the past five years (Ardoin et al., 2017), holes remain in this body of research. Most studies focus on middle school students while assessing outcomes within a short time frame (six months or less) with no follow-up investigation to determine long-term effects (Ardoin et al., 2017). Sweeping change in opinions or shifts in career aspirations are unlikely during the short time spans of these programs, but most claim an increase in environmental knowledge by the end of the program (Antink-Meyer et al., 2016;Bhattacharyya et al., 2011). Testing participants directly after a program almost ensures an increase in knowledge will be found (Bogner, 1998). The lack of follow-up in the majority of studies prevents evaluation of any lasting impact of these programs (Bogner, 1998;Redman & Redman, 2016).
The utilization of a comparison group is another study design component that is frequently missing from environmental education evaluations (Long, 2014;Paço & Lavrador, 2017). Sanacora (2017) explained how the lack of a comparison group in medical research lessens the impact of findings and prevents the discovery of confounding factors. This is true for environmental education research as well; comparison groups are necessary to help discern effects of interventions, such as treatments to improve environmental literacy from prior knowledge, skills, or attitudes (Sanacora, 2017).
Scores on pre-tests, post-tests, and follow-up tests can be compared and analyzed along with the scores of participants attending the science program (Sanacora, 2017). Without data from a comparison group, gains in environmental literacy cannot be definitively linked to the program studied (Long, 2014;Sanacora, 2017) since outside factors or flaws in experimental design can go undetected (Sanacora, 2017). A true control group is usually not possible with most environmental education studies since participants tend to choose their educational experience rather than be randomly assigned to one (Long, 2014). Still, the comparison group can alleviate many problems in data collection and interpretation of results (Sanacora, 2017).

Study Objectives
The objective of this study was to determine the importance of including (a) a comparison group and (b) a follow-up assessment in evaluating outcomes of an environmental education program.
We carried out this study in the context of asking a familiar type of education research question: "Does participation in a non-formal youth summer science program increase participants' knowledge and attitudes about the environment?" We addressed this question with particular consideration for how the inclusion of a comparison group might influence interpretation of our data, and how inclusion of a follow-up test might temper the magnitude of knowledge and behavior gains that we would attribute to the educational intervention.

Study Group Selection
We partnered with an existing summer youth science program that is administered by ESF, in collaboration with the Syracuse City School District (SCSD) and a local non-profit organization. The program, SCIENCE, provides a one-week long environmental science education experience to SCSD students during the summer. Educational modules are instructed by ESF undergraduate students studying in a variety of environmental fields. All methods were reviewed and approved by the Institutional Review Board of Syracuse University.
Student participants were all residents of the City of Syracuse, New York, who ranged in age from 11-17 years, were enrolled in grades 6-11, and were drawn from one of the following programs or schools.

Father Champlin's Guardian Angel Society (Guardian
Angels) is a Catholic charity that runs a free summerlong education camp for Syracuse students. Student participants are US-born children of immigrants attending SCSD schools. Children are typically enrolled to provide supervision and meals during the workday when schools are out of session.

Expeditionary Learning Middle School (ELMS) is a
Syracuse public school that draws students from across SCSD. The school serves students who need more attention because of a learning disability, problems with bullying, or behavior issues. ELMS has a mandatory orientation for its incoming sixth grade class. The first week of orientation involves getting to know the school, its schedule, teachers, and team building. No instruction occurs. The second week of orientation is run by SCIENCE.
3. Syracuse Academic of Science (SAS) is a Syracuse charter school that has a specialized STEM-focused curriculum and yields graduation rates and test scores that surpass those of SCSD and the New York statewide average. SAS runs a summer camp for students, one week of which is run by SCIENCE. Again, children are typically enrolled in the summer camp to provide supervised activities and meals during the work week.
SCIENCE participants were not self-selecting. They attended as part of another summer program that was not science focused. Most ELMS students were required to attend by their school, and the Guardian Angel's and SAS students were enrolled by parents or guardians. Therefore, it was not a concern that SCIENCE participants would have a pre-existing higher level of environmental literacy.

Comparison Group Selection
We structured the comparison group to include children who were participating in a similar supervised summertime recreation program, but which did not include a component of outdoor science education administered by SCIENCE. The comparison group was comprised of children attending the Town of Onondaga Parks and Recreation "Playgrounds" program. Playgrounds has been organized each summer for over 30 years. It is a recreation program without an education focus of any kind. It does not teach science or have any educational mandate. The object of Playgrounds is to let kids have fun. Activities range from playing games outside to arts and crafts projects. Unlike SCIENCE, whose instructors were ESF undergraduates studying science, Playgrounds' counselors were local teenagers who did not have science or education backgrounds. Though it is impossible to say no science topics were talked about during Playgrounds, the counselors did not have any expressed interests in science, the daily activities did not include science learning, and the kids were focused on sports and crafts. Most of the participants come from the SCSD. The children at Playgrounds are between 6 and 17 years of age, but only those 10 and older were included in this study. Playgrounds ran throughout the summer but only one week was sampled since the same children came each week. It is impossible to say no informal science education happened during the week sampled. However, we are certain Playgrounds' focus was not on science education and it is unlikely the topics covered in SCIENCE's curriculum and the environmental literacy tests we created were discussed at length.

Survey Instrument
Environmental literacy (EL) tests were created (Appendix A) with two sections, one to test environmental attitude (EA) and one to test for environmental knowledge (EK). The EA portion was the same for all age groups and throughout the study. Since the EA portion was to measure changes in attitude and behavior it was easy to create questions understandable by the wide age range of participants, negating the need for two grade levels. Since answers are not "right" and "wrong" and needed to be compared for changes, analyses were more consistent with only one version. The first set of EA questions were answered using a Likert scale. Participants could select least interested or likely (0) to most interested or likely (4).
The EK tests, however, were prepared for two age classes: sixth to eighth grade and ninth to eleventh grade. It was decided that questions appropriately challenging for ninth graders would be too difficult for sixth graders. The two levels had similar questions but these were phrased differently, with increased levels of terminology or depth of knowledge included in the 9 th -11 th grade tests. The EK test also needed different versions for the pre-, post-, and follow-up rounds to prevent participants from remembering questions in previous rounds as they could have discussed or looked up the answers before the next test. All versions had ten questions and were scored as right or wrong with participants' EK score being a percentage of correct answers. SCIENCE shared a schedule of a past camp to help determine test question topics. Once the curriculum for the 2017 camp was solidified, the rough draft of the EL test was given to the SCIENCE coordinator and counselors for review. They gave more updated feedback so the questions best reflected what would be covered in the camp. Questions were not pulled directly from curriculum activities nor were counselors instructed to teach what was on the test. The input was only to ensure topics mentioned on the EL test aligned with lessons during the camp.

Study Participant Selection
Pre-tests and post-tests were collected from children that participated in three of the six SCIENCE programs during summer 2017. Each program lasted one week with instruction occurring from 9:00 am to 3:00 pm. Playgrounds ran from 8:00 am to 4:00 pm at three different locations. Pre-tests and posttests were collected from children at each location one week during the summer of 2017. Pre-tests were administered by the researcher on site the morning of the first day. Post-tests were administered by the researcher on site the afternoon of the last day. Follow-up tests were administered after the start of the following academic year ( Table 1). Follow-up tests for Guardian Angels and Playgrounds were mailed out three months after the program week and had to be returned within a month. We administered follow-up tests for ELMS and SAS at the respective schools three months after the program week. Each mailed follow-up test contained a link to an identical online version, a paper version of the test, and a self-addressed stamped envelope for the test to be mailed back to the researcher. No participant filled out the online version, therefore all tests were completed on paper.

Data Collection
After each week, all the survey answers were entered into an Excel spreadsheet. EA and EK tests were kept paired for each participant. Participants created codes based on their initials and the year they were born (for example Joan Gabrielle Smith born 2007 would be JGS07). This allowed participant's pre-, post-, and follow-up tests to be matched while maintaining confidentiality. Answers were identified by this code on all research material.

Data Analysis
Summary statistics (average and standard error) of EA and EK scores were generated for the SCIENCE and Playground groups for the pre-test, post-test, and follow-up test. For EK scores, an unbalanced model I, 2×3 factorial ANOVA with an interaction term in 'R' was used to test for differences between the test means of the pre-, post-, and follow-up tests for SCIENCE and Playgrounds. To see if the data satisfied the assumptions of the test and were normally distributed, quantile-quantile plots were made for each group. For the EA, Likert scores (0-4) served as the response variables. Test type and participant group were the predictors. ANOVA with an interaction term in 'R' was included and simple effects were then tested. Interaction plots were made to show the distribution of mean scores for both programs over the three tests.

Environmental Attitude
The ANOVA revealed a significant main effect between the two participant groups, with the average SCIENCE EA scores being 10% higher than Playgrounds ( Table 2). The SCIENCE EA score was higher independent of the test type. An interaction was detected between test type and participant group, so we analyzed the simple effect between participant groups for each test type. Our data indicate that while no difference existed in environmental attitudes between the groups during the pre-test, and that the post-test scores suggested a greater increase in environmental attitude that might be attributed to the SCIENCE program, these differences disappeared by the time participants took the follow-up exam (Figure 1).
On the EA portion of the tests, participants answered questions on a Likert scale of 0-4 with 0 indicating 'least interested' and 4 indicating 'most interested' and their individual EA score was the average of their Likert responses to twelve questions. SCIENCE participants had a pre-test EA score mean (±1 SE) of 2.6±0.1, a post-test mean of 2.7±0.1, and a follow-up test mean of 2.6±0.1. Playground participants had a pre-test EA score mean (±1 SE) of 2.3±0.1, a post-test mean of 2.3±0.1, and a follow-up test mean of 2.6±0.2 (Figure 1). In Figure 1, data are EA score means (±1 SE) for the three test types associated with the two participant groups (SCIENCE and Playgrounds); points are connected to show the trend within the group; p-values next to the means report the simple effects of participant groups for each test; and to better show the data trends, the full y-axis (0-4) is not displayed.

Environmental Knowledge
On the EK portion of the tests, participants were scored on the proportion of ten questions that were answered correctly The EK scores for SCIENCE participants averaged (±1 SE) of 0.43±0.02 on the pre-test, 0.69±0.02 on the post-test, and 0.63±0.02 on the follow-up test. The Playground EK scores averaged 0.48±0.02, 0.62±0.02, and 0.73±0.03 for the pre-, post-, and follow-up tests, respectively.
The 2×3 factorial ANOVA (Table 3) detected a main effect of test type and an interaction between test type × group. The EK post-tests and follow-up tests for both groups were 50% higher than the pre-tests, independent of participant group. Participants gained knowledge regardless of instruction. However, SCIENCE scores increased from pre-test to post-test, then decreased partially for the follow-up test. Playground scores increased from pre-test to post-test and post-test to follow-up test.  Although an interaction was detected between test type and participant group, none of the simple effects between participant groups for each test type were significant. The simple effects between the pre-test and post-test were significant for both SCIENCE (p<0.001) and Playgrounds (p=0.01). Also, the simple effects between the pre-test and follow-up test for both SCIENCE (p<0.001) and Playgrounds (p<0.001) were significant. Figure 2 reports the interaction between test type and participant groups for EK scores. In Figure 2, data are EK score means (±1 SE) for the three test types associated with the participant groups (SCIENCE and Playgrounds); points are connected to show the trend within the group; p-values next to the means report the simple effects of participant groups for each test; and to better show the data trends, the full y-axis (0.0-1.0) is not displayed.

DISCUSSION
The results were not expected and revealed some interesting trends within the data. Some study limitations could have impacted the results and should be considered for future studies. Even with the unexpected outcomes, conclusions can be drawn and there are recommendations for future evaluation studies.

Environmental Attitude
These results show environmental attitude post-test scores of SCIENCE participants were not higher than their pre-test score and their post-test scores were not significantly different from their follow-up score. The EA scores for SCIENCE were not different across the three tests. The SCIENCE program did not increase the EA for participants. A short timeframe limits the impact on environmental attitude (Antink-Meyer et al., 2016). Antink-Meyer et al. (2016) found influencing attitude and behavior changes should be long term goals and even small increases in environmental attitude are encouraging. SCIENCE participants did have an elevated EA before the program when compared to the comparison group. Playgrounds EA scores also did not increase across the three tests, which was expected since it was not a science-focused program. The test for simple effects, in Table 2 and illustrated in Figure 1, did show a difference between SCIENCE and Playgrounds' post-test scores. SCIENCE participants' attitude increased more throughout the week than Playgrounds, however this gain is erased by the time the follow-up tests were administered.
The EA scores were different between SCIENCE and Playgrounds, as seen in Table 2 and Figure 1. SCIENCE had a higher EA score than Playgrounds at the beginning of the study, indicating SCIENCE participants came to the program with an elevated EA compared to Playgrounds' participants. This pre-existing difference was unexpected since participants of neither group self-select. SCIENCE is an environmental camp; however, most participants do not choose to go because of an interest in environmental science. The three partner organizations hired SCIENCE as part of larger summer camp programming that participants are enrolled in by parents out of need or requirement.  (pre-, post-, and follow-up) and participant group to SCIENCE and Playgrounds. Test: Group is the interaction within those two categories

Figure 2. Interaction plot for the EK pre-, post-, and follow-up test means of SCIENCE and Playgrounds
Only the SAS participants had a choice to attend the SCIENCE summer program. They accounted for about 20% of the total participants, so it is possible they influenced the pretest estimate if they possessed higher pre-existing environmental attitude. SAS is also science focused school and students could be more interested in science than the general SCSD school population. To see if SAS was skewing the EA score it was separated from the SCIENCE participant group into its own and the ANOVA was redone. This analysis revealed that SAS actually had a lower EA average than the rest of SCIENCE participants and therefore did not positively skew the SCIENCE EA score. Despite being the only science-inclined and voluntary group, SAS participants do not explain the elevated EA score.
Guardian Angels was focused on science learning during the 2017 summer. However, the focus changes each summer, and advertising did not stress it was science-based. The four participants from Guardian Angels in this study had a lower EA average score than ELMS, indicating Guardian Angels also did not positively skew the SCIENCE EA score.
The largest subset of SCIENCE participants, ELMS, requires incoming students to attend their orientation, which includes a week of the SCIENCE program. ELMS is not a science magnet school, and it does not do any preparatory teaching before SCIENCE that would influence environmental literacy. Even still, the ELMS participants must account for the elevated pretest scores in environmental attitude. Having a comparison group exposed the elevated pre-test scores and could indicate a change in methodology is needed to prevent this in the future. Pooley and O'Connor (2000) found it difficult to study environmental attitude because it is hard to quantify or qualify. Despite the difficulties, environmental attitude is essential to understanding how to influence behavior (Manoli et al., 2007). Knowledge alone does not affect behavior and the two in tandem must be studied to affect change (Pothitou et al., 2016).

Environmental Knowledge
SCIENCE and Playgrounds EK scores both increased by the end of the program and stayed higher into the school year. However, there was no difference between the participant group scores at each test (Table 3 and Figure 2) even when the simple effects were examined. Both participant groups' test scores increased over time despite only SCIENCE participants getting environmental programming. SCIENCE participants' EK did increase at the end of the program. This elevated EK did not diminish months after the program; it stayed at the higher level. However, the comparison group had similar results. Playgrounds scores increased from pre-test to post-test. Their follow-up test score was also higher than the pre-test, but not different than the post-test. This makes it impossible to state the SCIENCE program was what increased participants' EK.
The testing effect could have increased the post-test scores (Hartley, 1973). The testing effect is an increase in future test performance due to exposure to previous, similar testing (Hartley, 1973). The potential for the testing effect to increase both groups' scores is high. Hartley (1973) explained giving a pre-test influences the performance on a post-test, but acknowledged it is an ongoing debate. Even with changing the test questions each time, participants are still more familiar with the structure of the test and type of questions asked (Hartley, 1973). More recently, Kromann et al. (2009) studied how assessing skills helped develop them and found testing did increase knowledge of the material. The follow-up test scores could have been influenced by both the testing effect and the start of the school year. Being back in the school environment could have increased performance, participants might have taken the test more seriously in a school setting (von der Embse & Hasson, 2012). The percentage of openended responses was the highest for ELMS, who took the follow-up test during school hours, administered by teachers, in the classroom. This could indicate the test was taken more seriously.
These outside factors would not have been investigated without the EK scores of the comparison group. Looking at SCIENCE participant scores alone the results would have been as expected: the science program increased science knowledge and the knowledge increase was not temporary. It is only when compared to a non-science program that the results are called in to question. Since EK scores increased for the comparison group, the programming from SCIENCE cannot be said to have increased EK in participants.

CONCLUSIONS AND REPERCUSSIONS FOR THE FIELD
ESF's summer program, SCIENCE, does expose Syracuse children to environmental science and scientists. However, the impact it had on increasing environmental literacy was not significant for attitude or knowledge. This study predicted no increase in EK or EA from the comparison group; and for participants involved in the experiential science program an increase in EA and EK, that would be sustained over time. Many of the results were not expected and did not follow the literature. Future studies should use the same framework to attempt to understand why the results were not as predicted. Particularly the inclusion of a follow-up test to see if any gain from the program is sustained and a comparison group to uncover confounding variables.
Including a follow-up test uncovers the lasting impact of the environmental education intervention. Evaluating environmental literacy directly afterwards can show an immediate increase in environmental attitude or knowledge. Yet, to determine if this increase was sustained an evaluation must be conducted after significant time has passed. If a participant was truly inspired by the program, they will be more likely to have retained information and sought out further knowledge. Finding the ideal passage of time between intervention and follow-up group is not a simple endeavor. The longer the interval between the two the more difficult it is to follow-up with participants, which can result in too small of a sample size. The length of the program is also a factor, a follow-up to an hour-long talk may not need to be as delayed as long as the follow-up to a months-long camp. Further study into these factors would benefit future environmental education research. If a follow-up test had not been administered in this study, it could have been assumed the increase in SCIENCE participants' EA was sustained after the program ended.
This study also revealed how important a comparison group is to analyzing results. Without the data from Playgrounds, we would have inferred that SCIENCE increased participants' EK. Instead, we found EK increased for both groups, and therefore cannot attribute the knowledge gain to SCIENCE. It is uncommon for environmental education studies to include a comparison group. Lacking this key component in study design inhibits more complete data analysis from being conducted. Including a comparison group prevents incorrect conclusions from being reached and will allow our field to better analyze the impact of environmental education programming. We can recognize which interventions work with more accuracy. We will also be able to better identify confounding variables and discover how potentially to avoid them. Future studies should include a follow-up test and comparison group. These elements will ensure an improved evaluation of the environmental education program being studied.
Author contributions: All co-authors have involved in all stages of this study while preparing the final version. They all agree with the results and conclusions.

Funding:
No external funding is received for this article.

Declaration of interest:
The authors declare that they have no competing interests.

Ethics approval and consent to participate: Not applicable.
Availability of data and materials: All data generated or analyzed during this study are available for sharing when appropriate request is directed to corresponding author. 1. If there are pollution tolerant macroinvertebrates in a stream that means it is polluted.
True False 2. What are three major streams that flow into Onondaga Lake?
3. What is the process of soil or rock gradually wearing away called?
4. In scientific experimentation, what is a proposed explanation based on some evidence that needs further investigation called?
5. What percentage of the current water on Earth is from when the planet was first formed?
6. If a toxin biomagnifies, where in the food chain are higher concentrations found?
7. Does a balanced food web have more predator or prey species?
8. Name one characteristic of white pine that can be used for identification. 9. Name one example of green infrastructure.
10. What are the three categories of organisms that make up an ecosystem?