Predicting Penang Road Accidents Influences: Time Series Regression Versus Structural Time Series

Rapid growth of population and development of economies in most developing countries indirectly has created the additional burden to road safety. Along with the advantage of this progress indirectly has increase the vehicle usage among people and increase the volume of traffic that is the main contributor of road accidents. Malaysia is one of develop countries that contribute the highest number of road fatalities per 100,000 population among the ASEAN countries1. Penang which are the third largest economic contributor among the state in Malaysia have no exception in contributing the increasing number of road accidents. Various iniciative has been set up in ways to reduce this unexpected events. It is including conspicuity programme, road enhancement programme, road behavioural change and accident prevention and reduction programme2. However, the figure of road accidents occurrence were keep increasing year by year. Therefore the study aim to investigate the possible factor that influence road accidents occurrence in Penang state of Malaysia by comparing two statistical method that is time series regression and structural time series. Several reason were related to the road accidents occurrence. The structure of the road, vehicle, environment and road user attitude were among possible factor that influence the occurrence. Hence, several methods including statistical and mathematical methods were introduce too investigate this possible influential. The earliest study of road accidents can be found in Smeeds3 that relate the vehicle registration and population with the number of road accidents. Later on, Andreassen4 found that the origin of Smeed’s formul cannot be applied universally to all countries. In addition this approach has been applied many other reaseacher such as Valli5 that compare Smeeds and Andreassen equation and Nasarudin et al.6 that compare the performance of Smeeds methods with linear regression. Linear regression is the most simple model that relate dependent with independent variable that gave satisfactory performance in model development. Some of the recent studies that use this prediction techniques is Abstract


Introduction
Rapid growth of population and development of economies in most developing countries indirectly has created the additional burden to road safety. Along with the advantage of this progress indirectly has increase the vehicle usage among people and increase the volume of traffic that is the main contributor of road accidents. Malaysia is one of develop countries that contribute the highest number of road fatalities per 100,000 population among the ASEAN countries 1 . Penang which are the third largest economic contributor among the state in Malaysia have no exception in contributing the increasing number of road accidents. Various iniciative has been set up in ways to reduce this unexpected events. It is including conspicuity programme, road enhancement programme, road behavioural change and accident prevention and reduction programme 2 . However, the figure of road accidents occurrence were keep increasing year by year. Therefore the study aim to investigate the possible factor that influence road accidents occurrence in Penang state of Malaysia by comparing two statistical method that is time series regression and structural time series.
Several reason were related to the road accidents occurrence. The structure of the road, vehicle, environment and road user attitude were among possible factor that influence the occurrence. Hence, several methods including statistical and mathematical methods were introduce too investigate this possible influential. The earliest study of road accidents can be found in Smeeds 3 that relate the vehicle registration and population with the number of road accidents. Later on, Andreassen 4 found that the origin of Smeed's formul cannot be applied universally to all countries. In addition this approach has been applied many other reaseacher such as Valli 5 that compare Smeeds and Andreassen equation and Nasarudin et al. 6 that compare the performance of Smeeds methods with linear regression.
Linear regression is the most simple model that relate dependent with independent variable that gave satisfactory performance in model development. Some of the recent studies that use this prediction techniques is Zlatoper 7 that relates the annual series of total motor vehicle death and occupant death and pedestrian death with few explanatory variable. The study founds that income, driving speed, vehicle size and type of driving were signifantly related to the motorcycle death. In addition using similar method Desai and Patel 8 relate the hourly traffic volume with the average of hourly fatal accidents and total accidents. In the same way, Ali and Bakheit 9 compare the performance of linear regression with artificial neural network. The study founds that artificial neural network were closed to the actual value compared to regression model.
Generalized linear model is more suitable for count data. As a number of road accidents is considered as count data, some of the study prefer to use this approach. The earliest Malaysia road accidents study that is by Radin Umar et al. 10 applied poisson regression with the aim to determine the effectiveness of introducing the running head light campaign. This study found that this intervention has reduced the conspicuity related accidents by about 29%. Recently Sarani et al. 11 use the similar method in determining the effectiveness of enforcement of rear seat belt. In their study, it s found tha rear setbelt intervention has successfully reduce the number of people getting severe and slight injuries. Other researcher that use the similar method is Greibe 12 , Harnen et al. 13 , Abusini et al. 14 and Abdul Manan et al. 15 .
On the other hands, application of those approach as mentioned above may not suitable for series data. Therefore, Box Jenkin time series analysis such as Autoregressive (AR), Moving Average (MA), Autoregressive Moving Average (ARMA) and Autoregressive Integrated Moving Average (ARIMA) model were more preferred in modeling road accidents occurrence as in [16][17][18] . In addition, incorporated both cross sectional and series data, longitudinal or panel data analysis were more preferred in modeling road accidents study as in [18][19][20][21][22] . Commonly, achieving stationarity condition is one of the compulsory rule in application of time series analysis. However, stationarity condition in real series data may not always be easy to achieve. Detrending and deseasonalizing process may a good way of achieving stationarity, but some important information during the process of detrending and deseaonalizing may be loss.
For that reason structural time series were more preferable. Structural time series is another time series analysis classes that does not account the statianory condition and will be model all the time series component instead of removing it as in Box and Jenkin analysis. Structural time series has been used as a model and forecast in a variety of applications, such as financial time series, macroeconomic time series, and many others areas, such as medicine, biology, engineering, and marketing. The advantage of modelling an unobserved component, such as trend and seasonal, and allowing it to vary overtime made structural time series the most possible method to understand the pattern of occurrences of road accidents and at the same time identify the most influential factor contributing to road accident occurrences.
However, the application of structural time series in the road safety industry is very rare, especially in Malaysia. The earliest application of structural time series in road safety can be found in 23 . Therefore, the effectiveness of this approached were ensure by comparing it with time series regression that is another model that were preffered by the researcher. The rest of this paper will discuss the analysis approach used in this study followed by the estimation procedure, result and finally the discussion and conlusion of the study.

Methodology
This section will discussed general description on the methodology of the study and the data that have been used in this study.

Time Series Regression
Regression technique is normally used to investigate the relationship between variables. The variables used in regression analysis can be categorized as dependent and independent variables. The dependent variable is variable of interest that is normally influence the independent variables. While the regression technique was developed for cross sectional data, the technique has been employed for analysis involving time series variables. The general form of the time series regression involving k variable can be written as in the Equation (1).
Where y t is variable of interest, and 1 2 , ,..., t t kt x x x are the possible factor that might affect the dependent variable y t and e t is an error term. The regression model above may take two forms; a simple regression which involves only one independent variable and multiple regressions involving more than one independent variable. Also, the regression model may also include the lag of dependent and independent variables.
The parameters of the model are estimated by using Ordinary Least Square (OLS) method and a few assumptions pertaining the model and error term are examined. Durbin Watson (DW) test statistic is used to examine the serial correlation in the error term and lag of dependent variable or independent variable is included in the model as remedy to the serial correlation problem. Multicollinearity is tested by using Variance Inflation Factor (VIF) where the multicolinearity is said to exist if the VIF is greater than 10. For complete discussion of time series regression technique refer to 24,25 .

Structural Time Series
Traditionally, the representation of time series data can be treated as sum of trend, seasonal and irregular components.
Where y t denote the observation of interest, m t are linear deterministic trend consisting of level and slope component, g t and e t are seasonal and irregular component which are also often treated deterministically. Application of this formulation representing time series process has a limitation since the model does not allowed the components to evolve over time 23 . The structural time series model allows the components of time series process to vary overtime. The structural time series approach use a linear Gaussian model which often referred as state space model. The model is based on a state space form relating an observed variable to an unobserved component representing the various time series component such as level, slope and seasonal. The observed variable y t is related to unobserved component, a t through a measurement equation as written below.
Meanwhile, the unobserved component follow first order autoregressive process as given by the following state equation.
Where e t and h t are vector of disturbances and Z t , T t , R t , H t , and Q t system matrices which need to estimated and usually treated as constant. The (n ´ 1) observation y t , consist of n observations and a t is unobserved state vector (m ´ 1). The (n ´ 1) irregular vectors of, e t has a zero mean and (n ´ n) covariance matrix H t . The (n ´ m) matrix Z t link the observed vector y t with the unobservable state vector a t and the measurement equation may consist of regression variable. The (m ´ m) transition matrix T t , determines the dynamic evolution of state vector. The state disturbance Z t is (m ´ r) matrix while (r ´ 1) disturbance vector h t has zero mean and covariance matrix Q t . The observation and state disturbance e t and h t are assumed to be uncorrelated with each other at all time period. By appropriate choices of vector a t the matrices Z t and T t , wide range of time series model can be written in the state space form of (3) and (4). The time series model in (2) can be written in state space form to allow evolvement in each of its components. The structural time series model with local level with drift (deterministic slope) and local linear trend with seasonal (say period s) are written as (5) and (6) To investigate the effect of the selected independent variable, the explanatory variable and intervention variable can be added to the measurement equation of the model. The effect of selected independent variable 1 2 , ,..., k x x x can be investigated based on the following measurement equation.
The structural time series in (5), (6), (7) can be estimated using iterative procedure of Kalman filter technique. Details on Kalman filter estimation procedure can be obtaining from 26 . The step wise fashion is use to develop structural time series of road accidents model. The best structural time series model is chosen based on the smallest Akaike information Criterion (AIC).
Similar to others regression and time series analysis, it is customary to diagnose the estimated residual to fulfill certain assumption. The three assumptions for disturbances, e t that need to be fulfilled are independence, homoscedasticity and normality which are diagnosed using Ljung-Box (LB), Goldfelt-Quandt (GQ) and Jarque-Bera (JB) tests respectively. For comparison purposes the similar test were applied for time series regression.

Prediction Accuracy
Prediction accuracy for both models will be evaluate by using three famous loss function measure that is Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE). The equations for each measure are given as in Equation (8).
Where y t and ˆt y actual observed value and predicted value respectively, and n is the number of predicted values.

Data Considered
The data considered in this study include monthly number of Road Accidents (RDAC) from January 2001 until December 2013. This data are obtained from Royal Malaysia Police Department referred as an occurrence on the public or private road due to the negligence or omission by any party concerned (on the aspect of road users conduct, maintenance of vehicle and road condition) or due to environmental factor (excluding natural disaster) resulting in collision (including out of control cases and collision or victim in vehicle against object inside or outside the vehicle eg: bus passenger) which involved at least a moving vehicle, structure or animal and is recorded by the police 27 . The monthly number of road accidents datasets act as dependent variable for this study.
Independent variables for this study will include the climatic variable, oil price, law enforcement, and school holiday. Data for climatic variable were obtained from Malaysian Meteorological Department and Department of Environmental. The climate factors that were considered in this study include monthly average amount of rainfall (in milliliter) (MRAIN), number of rainy days (DRAIN), monthly average temperature (in degree Celsius) (TEMP) and air pollution index (API) in Penang. The climatic variables had been most commonly used in previous work in road safety modeling literature 16,20,22,28,29 . While, the data for crude oil price (in Malaysian Ringgit per Barrel) (OILP) are access from The World Bank website. The price is simple average of three spot prices that is Date Brent, West Texas Intermediate and Dubai Fateh.
The other data that also have been considered is the enforcement of the road safety and new traffic legislation. The data that was included is enforcement of rear seat belt law (BELT) and Ops Sikap (OPSKP). The compulsory of rear seat belt traffic legislation were enforced since January 2009. Ops Sikap or Ops attitude is traffic safety operation carried out by Royal Malaysia Police to nurture the safety awareness on all roads in Malaysia during festive seasons. It is began during December 2001 and only operated during main festival season such Eid ul-Fitr, Deepavali and Chinese New Year. Both variable are presented as dummy variable where "0" represent before enforcement and "1" represent after enforcement.
Another variable that will be included in this study are school holiday (SCHOL) one of the calendar effect variables. The SCHOL variable is included with the aim to look whether school holiday have an impact on road accident occurrence. Utusan Malaysia 2012, claimed that road fatalities increase up to 18-20 per day on school holiday compared to not more than 15 death on school days 30 . In contrast 28 claimed that during school holiday traffic volume in Australia reduce 3.4% of the main daily volume compared to school days. Similar with law and legislation variable, SCHOL also presented by dummy variable that "1" if the school holiday is occurred and "0" if no school holiday.

Model Estimation Results
This section discusses the results of model estimation using both time series regression and structural time series model. Recall that in this study the dependent variable, y t is the number of road accidents (RDAC) in Penang state, while the independent variable are average temperature (TEMP), average rainfall (RAIN), crude oil price (OILP), enforcement of rear seat belt law (BELT), operation of Ops Sikap (OPSKP) and school holiday (SCHOOL).

Road Accidents Model with Time Series Regression
The result of the estimation of the time series regression model using OLS method is given in Table 1. Trend and seasonal variable is included as to consider time variable. Trend variable is time variable, t=1,…,156 and seasonal variable is dummy of month variable where month 12 as the refrence or as the benchmark. Table 1 shows the estimated model fit the model fairly to the data with the value of R 2 of 93.5%. Multicollinearity does not exist since the values of VIF are less than 10 except for trend variable. However the DW statistic of 1.467 shows a serial correlation in the estimated residual as the value is lower than lower limit of 1.55. The lag of the dependent variable is included in the model as remedial to the serial correlation. The result of the new estimated model with lag dependent is given in Table 2. The trend variables were excluded as it was show the multicolinearity was exist. Table 2 shows that the serial correlation problems occur in previous model was removed with the inclusion of lag dependent variable. The value of R 2 show that approximately 91% total variation of road accidents occurrence were explained by the selected variable. The estimated model shows that an increase in the rainfall, number of rainy day, temperature and the oil price results in an increase in the number of road accidents. The finding that the amount of rainfall increases the number of  In addition, surprisingly the estimated model suggests that the number of road accidents also increases after the enforcement of rear seat belt law and as the operation of Ops Sikap took place during the festival period. Meanwhile the model indicates that the number of road accidents decreases with the increases in API and during school holiday. Table 2 shows that only enforcement of rear seatbelt law and increase in oil price has significant effect on the number of road accidents in Penang. Figure 1 shows the fitted value obtained from the time series regression and the actual values of the number of road accidents. Generally the plot shows that, the fitted values change very similar to the observed values. While, Figure 2 show the residual plot of time series regression model. The plot show there is serial correlation in the residual.

Road Accidents Model with Structural Time Series
The structural time series models are developed by using stepwise procedure. The analysis begins with simple univariate local level model, followed by local linear trend model and finally adding the seasonal component with both models. All the model were estimated without the independent and lag dependent variables. The variance disturbances are estimated by using the maximum likelihood estimation approach, while the state components are estimate using the Kalman filter technique. Table 3 presents the estimated variance disturbance and the main diagnostic test for each structural time series. Referring to Table 3 the first model develop that is local level model show that the estimated residual does not fulfilled the residual assumption of serial autocorrelation and equal variance. Adding the slope component that is yield local linear trend model also show no improvement. The AIC value of the model is increasing compared to local level model. Besides, the estimated residual assumption of independence once again does not fulfilled. Since the slope component does not improve the model, local level with seasonal were develop. This model allow level and seasonal component to vary overtime while excluding the slope component. This model show little improvement where all the estimated residual have fullfiled the structural time series residual assumption. However the seasonal disturbance value is too small that indicate that seasonal component rarely change overtime. In that case the similar model is develop by fixing the seasonal disturbance. The result show that the model was improved as the AIC value, were among the lowest with all the estimated residual were satisfied the residual assumption. To ensure the local level with fixed seasonal model is the best model to present Penang road accidents, the slope component once again were incorporated. Although the estimated residual satisfied all the assumption, adding the slope component does not show any improvement as the AIC value show some increment. Therefore the best model to represent the monthly number of road accidents is local level with deterministic seasonal. This model allowed level component to vary overtime while the seasonal component is fixed. Figure 3 shows that the level of road accidents is increased from overtime with the true observations are above and below the estimated level. Due to the deterministic assumption of seasonal component, Figure 4 shows a seasonal

Adding Explanatory and Intervention Variable
Since the aim of this study is to investigate the most influential factors that contribute to road accident, this lead to consideration of adding the univariate model above with few explanatory and intervention variables. The best univariate model to represent the monthly number of road accidents is local level with the deterministic seasonal model. By adding explanatory and intervention variable the estimation can be simplified as in Table 4. The lag of dependent variable also added to make the model comparable with the time series regression model. Table 4 shows the estimated structural time series model for the number of road accidents when accounting for the eight selected explanatory variable. For comparison purposes the estimated coefficients in the time series regression model are reproduced.Similar to time series regression model, the result from the structural time series model show negative effects of an increase in the API and during the school holiday and positive effect of an increase in the rainfall, number of rainy day, temperature and during the OPSKP operation on the number of road accidents. Unlike the estimated regression model, the estimated structural time series model shows the negative effect of an increase in oil price and the enforcement of rear seatbelt. The structural time series model provides better reflection on the effect of enforcement of rear seatbelt law that reduces the number of road accidents in Malaysia. These two variables are also found to be significant at the 10% level in time series regression.   While in the structural time series model road accidents were significantly related to with the increasing temperature and during the OPSKP operation. The results of increasing the temperature may increase road accidents by 1% were agree with 31 that state, road accidents is more likely to occur during hot temperature. Driving performance during high temperature can possibly worse due psychological and physiology effecs of ambient temperature. The increasing of road accidents during OPSKP is expected since the OPSKP is only implemented during the festival seasons, when the traffic volume become much higher as Malaysian take full advantage to go back to hometown and has contributed higher no of road accidents.
This result shows that the main factor of road accidents is related to the traffic volume and temperature. The traffic volume can be reduced by greater campaigning for car pooling and improving the public transport particularly on bus and train. Greater encouragement of car pooling not only reduces cost of fuel and air pollution, but it also reduced the stress of driving alone that may lead to road accidents. Improvement in public transport such as increasing the frequency and accessibility of local and school buses as well as inter-state train journey, would help to reduce the traffic volume and road congestion. The temperature effect, may not be control but the prevention measure during the adverse temperature should be introduce.
Although the value of R 2 from the structural time series model is smaller than the corresponding value from the time series regression model, the time series regression model show that it does not fulfill the independence assumption for the residual. The better use of structural time series model can also be seen from Figure 5 whereby the fitted values are closer to the observed value as compared to that in Figure 1 for the case in time series regression model. For instance it is also can be observed from Figure 6 of the residual plot of structural time series that   are much closer to independent random values compared in Figure 2. It is proved from independent test of the diagnostic residual as in Table 4.

Prediction Accuracy
Recall that in this study two statictical model were used to model Penang road accidents. In that case to ensure the best fitted model between this two preferred model, loss measurement such as RMSE, MAE and MAPE were use. Twelve in sample data that is from January 2013 until December 2013 were used as a benchmark to test the prediction accuracy. The result were tabulated as in Table 5. The result indicate that, both model have no much difference but the structural time series model prediction is superior compared to time series regression. It it can be seen through the both loss function that show structural time series give less error than time series regression. The illustration of this prediction were illustrated in Figure 7. This figure clearly show that the estimated of structural time series were closed to the actual value.

Conclusion
This study compares two time series methodology which is the time series regression and structural time series in modeling the number of road accidents in Penang. In addition, this study aims to investigate the factors that affect the number of road accidents. The study found that the two models are different in terms of the significant factor affecting the number of road accidents. The structural time series model is found to not only better fit the number of road accidents but give better prediction with all the residual assumption were satisfied. Not surprisingly the structural time series is a better model since it allows the level of the number of road accidents to vary overtime. The structural time series model shows the temperature, and operation of Ops Sikap are significant factors influencing the number of road accidents in Penang. The enforcement of rear seat belt law also helps in reducing the number of road accidents.
This study has a few limitations since it is developed only for the case of Penang state and it may not reflect the scenario of road accidents in other states in Malaysia. Besides, the model may not accurately represent the number of road accidents as the data only representing reported cases of road accidents while there might be more unreported cases of accidents. Further investigation may include other relevant variable such as the traffic volume and economic factor.