There exists large number of works utilizing the concept of Machine Learning (ML) and Deep Learning (DL) approaches for the analysis and forecasting of time series data. Autoregressive Integrated Moving Average or ARIMA, Support Vector Regression (SVR) and Random Forest (RF) are some common ML techniques which are popular in this field. In the DL domain multiple works have been done using different Recurrent Neural Networks (RNN) such as Long ShortTerm Memory (LSTM) and Gated Recurrent Unit (GRU). Some other DL techniques like Deep Neural Networks (DNN), Convolutional Neural Network (CNN) are also popular
Despite the presence of multiple works in the same domain, there exist some research gaps. The linear models like ARIMA perform well in capturing the linear relations in the data set. But they are quite inefficient in handling the nonlinear relationships hidden in the historical time series data. Some recent works tried to take advantage of DL based models in the field of time series analysis. However, DL techniques work better on larger datasets with large number of features
Further, it has been observed that, prediction of closing and opening prices of stocks been the key interest area for most of the researches done in this field. But practically, prediction of the next day’s Open, High and Low prices will help a trader for proper planning of an intraday trade more effectively. Thus, there is a need to setup a framework that takes the advantages of ML and Ensemble techniques and can predict the said parameters with significant accuracy to minimize risks and maximize the profits of an intraday trader.
In the current work, an amalgamated ML approach, named LinVec (borrowed from the terms Multiple Linear Regression and Support Vector Machine), capable of capturing both the linear and nonlinear relationships present in the dataset is proposed. The major driving force behind the proposal of such an Ensemble technique is that it improves the performance of the model when compared to its individual counterparts. In the proposed Ensemble method, two popular ML techniques, SVR and Multiple Liner Regression (MLR) are used as base models. Both of these base models are trained on the same data from the training set. Each model learns differently and produces separate sets of outputs. The results obtained from the base models, along with corresponding real target values are next fed to the metamodel, which is another MLR model with some necessary modifications. The metamodel is trained on the newly created data set using kfold crossvalidation and becomes capable of combining the predictions from the two base models to improve the results further. Considering more than a thousand test cases it has been inferred that the present amalgamate technique produces better results than its comprised individual parts.
The proposed model is capable of handling any Time Series data. The model analyses the dataset and tries to find the hidden time sensitive relationships among the features and the targets. The model may be trained on the most recent data available to forecast the future values for certain targets. To justify the capability of the proposed framework it is tested in the field of Stock Price prediction. The model can be trained on the historical price movement data available till the end of the presentday market to forecast the next day’s price movements. Feature selection plays an important role in the preparation of an ML model capable of analyzing time series data. The price movement features of an individual stock, like daily open, close, high, low prices, delivery percentage, etc. and some features from the key as well as sectoral indices are taken into consideration. By analyzing all these features, the proposed model makes a prediction of the next day’s price movement of an individual stock. The optimum values of the evaluation metrics justify the accuracy of the predictions and efficiency of the proposed model. The outcomes of the proposed method are compared with similar work done by Henrique et al.
The remaining part of the paper is organized as follows. Section 2 deals with the preliminaries. Section 3 describes the methodology. Section 4 presents the empirical results of the approach and the comparison of the results with a previous work done in the same field. Finally, the conclusions and future directions of the research are discussed in Section 5.
The proposed model is designed to handle all types of data which are sensitive to time. In this work Stock Price movement, which is one of the most popular time series problem, is taken into consideration. Stocks of six companies are studied from the Indian Stock Exchanges. We have chosen two companies from each of the Large Cap, Mid Cap and Small Cap categories, resulting in six as a whole. Largecap companies have wellestablished businesses and have a significant market share. They have market caps of INR 20,000 crore or more. Midcap companies are companies whose market cap is above INR 5,000 crore but less than INR 20,000 crore approximately, whereas the companies whose market capitalisation is of less than INR 5,000 crore are generally put in the Small Cap category. The National Stock Exchange of India (NSE) has many indices to provide information about the price movements of stocks and different forms of investments for the companies listed on Indian Stock Exchanges. Stock market indices are meant to capture the overall behaviour of equity markets
Another area of interest while considering stock market data is the sectoral indices. The sectoral index data provides an idea of how a particular sector is performing. If all the related stocks of a sector are performing well, it signifies that the demand for that sector is increasing. It is then very likely that the stock of our interest from the same sector will also grow. NSE has multiple sectoral indices







Reliance Industries Ltd. (RIL) 
NIFTY 50 
Nifty Oil & Gas 
12Feb 2020 
09Feb 2022 
496 
Ashok Leyland Ltd. (ASHOKLEY) 
NIFTY MIDCAP 50 
Nifty Auto 
01Jan 2018 
30Dec 2019 
490 
Birlasoft Ltd. (BSOFT) 
NIFTY SMALLCAP 50 
None (NIFTY IT to be considered) 
01Jan 2020 
30Dec 2021 
499 
Sun Pharmaceutical Industries Ltd. (SUNPHARMA) 
NIFTY 50 
Nifty Pharma 
01Apr 2020 
05Apr 2022 
500 
AU Small Finance Bank Ltd. (AUBANK) 
NIFTY MIDCAP 50 
Nifty Bank 
01Jan 2020 
31Dec 2021 
500 
Sobha Ltd. (SOBHA) 
NIFTY Smallcap 250 
Nifty Realty 
04May 2020 
13May 2022 
507 
For the present study, approximately a hundred features from individual stocks, sectoral and key indices have been considered and tested on the model. After testing rigorously, only twelve features, which provide optimized RMSE, MAPE and R^{2 }values are selected. 10 out of these 12 features are considered from the properties of individual stock which are: Daily Open Price, High Price, Low Price, Last Price, Close Price, Average Price, Total Traded Quantity, No. of Trades, Deliverable Qty. and Delivery Percentage. The other two features are taken from the indices. They are the Daily Open Values of the Key Index and the Daily Close Values of the Sectoral Index.
The data set has 3 target variables: Next Day’s Open, High and Low prices. The features and the targets of the data set are tabulated by using the following rules. The next day’s open, high and low prices of a stock are considered as the target for the present day. During the learning phase, the model analyses all the features of presentday and the next day’s open, high and low prices, as the real target values and tries to find the relation among them (targets and features). During the test phase, the model only takes the presentday data and tries to predict the next day’s real price values of that stock.
The prepared data set has different features with different numerical ranges. For example, the stock values are in the range of hundreds or thousands (INR), whereas the numbers of trades and deliverable quantity values are in the range of Lakhs or Crores. Due to this heterogeneity, some features may dominate others during the training procedure. To avoid this problem all the features are needed to be brought down to a common scale. For this purpose, data standardization is performed all of the features based on the following formula:
Where the standard score of a sample is x, u represents the Mean of the training samples, and s represents the Standard deviation of the training samples.
To evaluate the price prediction performance of the proposed model, we used three evaluation metrics: Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE) and R^{2 }(coefficient of determination) regression score
The RMSE, MAPE and R^{2 }Score can be calculated according to Eqns. 2, 3 and 4 respectively.
Where, RMSE=Root Mean Square Error, N=Number of nonmissing data points x_{i}=i_{th }real sample value
Where, MAPE=Mean Absolute Percentage Error, n=number of nonmissing samples, A_{t}=Actual Value, F_{t}=Forecast Value
Where, SS_{res}=Sum of squares of the residual errors, SS_{tot}=Total sum of the errors
We propose a hybrid model based on Stacked Ensemble Machine Learning technique that implements the concept of Stacked Generalization. The implementation utilizes Support Vector Regression (SVR) and Multiple Linear Regression (MLR) as the base models and another MLR as the Meta model. The base and Meta models are trained on the same training set. The two base models get trained differently and learn distinct properties of the same dataset. The Meta model is utilized to combine the predictive capability of both of the base models. The final predictions are observed from the Meta model, which are more accurate than the predictions of the individual base models.
The Stacked Generalization or Stacking is an ensemble machine learning algorithm, where the metamodel learns how to combine the outputs produced by the base models in an efficient way so that the final model outperforms the individual base models. This technique uses two or more heterogeneous Base Models/Estimators fit on the same training data. These base models get trained on the same data but capture different aspects/facts. Here the problem is which among the base models to trust. The solution is to use another machine learning model, called Meta Model/Final Estimator, that learns which base model to trust more and when. Firstly, all the base models are trained on the same training data. The predictions from the base models become the features and the expected outputs become the target of the training data set for the metamodel. The base estimators are trained on the whole training data. But the final estimator is generally trained using the KFoldCrossValidation technique. In this technique, the whole data set is split into K consecutive nonoverlapping groups/folds. Each fold is then used as a validation set and the remaining K1 groups are used as training data sets for a model.
Support Vector Machine (SVM) is popularly and widely used for classification problems in machine learning. A classification method based on SVM maps the independent variables of N samples available into a space of more dimensions and is typically used to classify observations between groups. This method uses
In our study, the objective is to predict the future prices of Stocks. Thus, the goal is not to classify the results into groups, but rather to estimate the real values. Therefore, we use SVR to obtain a regression model. In any regression problem, we try to estimate a function that approximates mapping from an input domain to real numbers, on the basis of training samples. SVR is a supervised learning algorithm that can be used to solve regression problems. It can handle linear and nonlinear regression problems both.
SVR can transform a nonlinear regression problem into linear regression through the implementation of a kernel function, which projects the original feature space into a higherdimensional space. A hyperplane is then used to fit the projected space and the estimated parameters can be used for subsequent prediction. SVR uses the same principle as the SVMs. It gives us the flexibility to define how much error is acceptable in our model and will find an appropriate line (or hyperplane in higher dimensions) to fit the data. In simple regression, we try to minimize the error rate but in the case of SVR a certain threshold bound is used to fit the error inside. SVR tries to approximate the optimized value within a specified margin called ϵ (epsilon) tube. The objective function and constraints of the SVR model are as follows:
Where W is a vector normal to the hyperplane. y_{i }is the target, w_{i }is the coefficient, and x_{i }is the predictor (feature).
Slack variables are introduced in SVR to relax the stiff conditions. The new objective function and constraints, after the addition of the slack variables, are as follows:
The constant C is the regularization parameter. The strength of the regularization is inversely proportional to C. It should be strictly positive.
MLR is an extension of Linear regression (LR) that uses multiple independent variables as input and produces the value of one dependent variable as output using the following formula:
Where, y = Dependent Variable, b_{0 }= YIntercept (constant term), x_{1}, x_{2}..., x_{n }= Independent Variables, b_{1}, b_{2}..., bn_{ }= Regression Coefficients/slopes.
The proposed method uses an amalgamated model based on Stacked Generalization. The overview of the architecture is shown in
During the training, the whole train data set is fed to both of the base models. The metamodel is trained on the crossvalidated predictions of the base models using the kfold crossvalidation technique. The SVR model present at level 0 is trained on the full training data set. The kernel is set to Linear. The regularization parameter C is set to 1.0 and epsilon (decision boundaries from the hyperplane) is set to 0.1. During training the attributes of the model i.e. the support vectors, yintercept value and 12 coefficients (each for one feature in the data set) are calculated. The second model in level 0 which is based on MLR is also trained with the whole training data set. In this model, values of 13 parameters: one yintercept and twelve coefficients for 12 features present in the data set are needed to be tuned.
The MLR metamodel is trained on the prediction of the base models generated using kfold crossverification technique. First, the whole training data set is divided into almost identical k groups/folds. In a single iteration one group out of this k folds is considered as the validation data set and the remaining k1 groups are considered to be the training data set. First, the base learners are trained on the training set (selected from the kfolds) and then the predictions are generated on the validation data set. This process is repeated for the remaining k1 folds. Each time the predictions from the base learners are stacked to create an augmented data set. This augmented data set along with the real target values are used as the training data set for the metalearner/final estimator.
During the training on the augmented data set, the metamodel calculates a single yintercept and two regression coefficients (one for each base model prediction). The value of k plays a crucial role in generating accurate results, but unfortunately, no concrete method is there for deciding it. Hence, starting with k=2, a type of trialanderror method was followed and found that k=25 produces the best result for our model. The procedure of training/testing of the models and finding of the value of k are detailed in the pseudocode depicted in
The steps followed during the experiments are depicted in
For the sake of simplicity, the plots consisting of Real and Predicted prices for three companies each from Large, Mid and Smallcap categories are considered.
The closeness of the real and predicted values in all of the plots signifies that the proposed model is strong enough to find the hidden time series patterns in the daily stock price movement data and can predict the next day’s prices with significant accuracy across all of the Large, Mid and Smallcap categories yielding practical benefits to the traders and investors.






RMSE 
MAPE 
R2 Score 
RMSE 
MAPE 
R2 Score 
RMSE 
MAPE 
R2 Score 

RIL 
0.04737 
0.02581 
0.99208 
0.08115 
0.04106 
0.97802 
0.08571 
0.04611 
0.97251 
ASHOKLEY 
0.04065 
0.01334 
0.9897 
0.08043 
0.0302 
0.95691 
0.06485 
0.02843 
0.97507 
BSOFT 
0.04938 
0.0088 
0.99305 
0.12986 
0.02773 
0.95297 
0.11939 
0.02836 
0.95943 
SUNPHARMA 
0.05398 
0.01612 
0.98911 
0.09647 
0.02562 
0.96535 
0.08572 
0.02709 
0.97253 
AUBANK 
0.0559 
0.02731 
0.98014 
0.09455 
0.05311 
0.94104 
0.13632 
0.06601 
0.88183 
SOBHA 
0.06794 
0.01933 
0.9876 
0.13121 
0.04236 
0.95435 
0.13875 
0.04654 
0.9502 




Company Name 
Country 
RMSE 
MAPE 
Company Name 
RMSE 
MAPE 
R2 Score 
Banco do Brasil 
Brazil 
0.16427 
0.20141 
RIL 
0.04737 
0.02581 
0.99208 
Alpargatas 
Brazil 
0.08676 
0.31555 
ASHOKLEY 
0.04065 
0.01334 
0.9897 
Metal Leve 
Brazil 
0.14270 
0.10383 
BSOFT 
0.04938 
0.0088 
0.99305 
Angie's List 
USA 
0.10194 
0.42575 
SUNPHARMA 
0.05398 
0.01612 
0.98911 
Ping an Insurance 
China 
0.31341 
0.31624 
AUBANK 
0.0559 
0.02731 
0.98014 
IMAX China Holding 
China 
0.19135 
0.27265 
SOBHA 
0.06794 
0.01933 
0.9876 
Analysis of time series data such as predicting stock price movement is a challenging task due to constantly changing market conditions, which are dependent on multiple parameters resulting in very complex patterns. The data available at various stock exchanges help very little to predict the future behavior of that stock. However, our proposed model is efficient enough to find out and analyze the hidden patterns, hence producing quite accurate future predictions, resulting in low RMSE, MAPE values and high R^{2 }Scores.
The main influence on the daily stock price movement is the events or news related either to that company, to the sector to which the company belongs, or to any local or global news that can affect the price movement. Any such aftermarket news can abruptly change the building pattern in the stock price movement. If somehow, we can quantify the effect of this news and can include it in the study, maybe the ML models can provide better estimations.
The authors are grateful to the reviewers and facilitators whose constructive comments were useful in improving the content on this document. This work was supported by Bidhan Chandra College, Rishra and Barrackpore Rastraguru Surendranath College; by providing the platforms and means.