Artículo Volumen 15, Nº 1, 2021

Price’s Index trough of Web Scraping


Paulina Pegueroles Encina, Rubén Guerrero Vera, Amaru Fernández Durán, Diana López


Sobre los autores


Purpose – This study presents a design for a Real Estate Price Index for Región Metropolitana in the period January 2017 to August 14 2019. The Index measures the dynamism of the sector and it was built by processing 750,000 observations obtained from the website This site is a secondary source of information providing deterministic variables to build a formal, reliable and representative database.

Design/methodology/approach – In order to generate the Real Estate Price Index in Región Metropolitana, we adopted a methodology based on synthetic index, particularly the Hedonic Price Model (HPM), what is explained in terms of characteristics (heterogeneous).

The challenge will be to add more observations to the sample collected from the website TocToc and others related.

Findings –The Laspeyres methodology showed the highest results, but marginal volatility compared to the other techniques. In addition, the results presented a declining trend on household final prices in the second quarter of 2019 due to both the low number of transactions, and the stagnation of the construction sector, showing similar results to the ones provided by Central Bank of Chile.

Originality/value: The use of web scraping as a tool to obtain the data in real time, allowing to analyze Real State prices at the moment, is an advantage, because other indexes have a six month or more time of delay.



Propósito: este estudio presenta la elaboración de un Índice de Precios Inmobiliarios para la Región Metropolitana en el período que comprende desde enero de 2017 al 14 de agosto de 2019. El índice, que mide el dinamismo del sector, se construyó procesando 750.000 observaciones obtenidas desde el sitio web Este sitio es una fuente secundaria de información que proporciona variables deterministas para construir una base de datos formal, confiable y representativa.

Diseño / metodología / enfoque: para generar el Índice de Precios Inmobiliarios en la Región Metropolitana, adoptamos una metodología basada en el índice sintético, en particular el Modelo de Precios Hedónicos (HPM), que se explica en términos de características (heterogéneas).

El desafío será agregar más observaciones a la muestra recopilada del sitio web TocToc y otros relacionados.

Hallazgos: la metodología de Laspeyres mostró los mejores resultados, pero una volatilidad marginal en comparación con las otras técnicas. Además, los resultados presentaron una tendencia a la baja en los precios finales de los hogares en el segundo trimestre de 2019 debido al bajo número de transacciones y al estancamiento del sector de la construcción, mostrando resultados similares a los proporcionados por el Banco Central de Chile.

Originalidad / Valor: el uso del web scraping como herramienta para obtener los datos en tiempo real, lo que permite analizar los precios de los bienes inmuebles en este momento, es una ventaja, ya que otros índices tienen un retraso de seis meses o más.



This study presents a design for a Real Estate Price Index for Región Metropolitana in the period January 2017 to August 14 2019. The Index measures the dynamism of the sector and it was built by processing 750,000 number of observations obtained from the website Toc This site is a secondary source of information providing deterministic variables to build a formal, reliable and representative database.

The Real Estate market is key for the growth of some countries1 due to its role in the gross generation of fixed capital, consumption, and the financial system. Any variation in housing prices impacts the nations’ added demand and its financial stability, as Idrovo & Lennon state:

En Chile y en la mayoría de los países desarrollados la inversión inmobiliaria cumple un rol fundamental en la actividad agregada. Tal es así que en la composición de Cuentas Nacionales del Banco Central se considera a la vivienda como un componente de la Formación Bruta de Capital Fijo, equivalente al 29% del valor corriente de la inversión en construcción (esto es un tercio de su valor real)  y a 2,4% del PIB agregado (2011, p. 3).

Central Bank of Chile (2014) have developed a real estate price index, showing biased estimates because of the diversity of property, posing a need for a set of uniform objects to remove or minimize this bias. On the other hand, dwellings are characterized by the diversity of their characteristics, making them unique and incomparable goods. This aspect presents the main difficulty to elaborate a real estate price index.

The development of an index depends on applying different methodologies to decrease the bias, as Idrovo & Lennon (2011) state:

Estos sesgos sólo podrían eliminarse si se comparan en cada período los precios de mercado de exactamente los mismos inmuebles, lo cual es empíricamente imposible, por lo que surge la necesidad de utilizar métodos econométricos que permitan construir en forma teórica lo que no puede sostenerse en la práctica (2011, p. 4).

The problem stands, therefore, on how to analyse housing price variations considering the dwellings’ attributes and minimizing the bias in the data, as found in previous research carried out by Banco Central de Chile (BCCh) and Cámara Chilena de la Construcción (CChC).



 The main objective of the research is to prove that a secondary source of information can be used to represent the situation in the sector timely and effectively2 similarly to a primary one. It also aims at showing that features such as price, square meters and the location of the dwelling, are not the main deterministic variables for designing a real estate housing index.

The information provided by this real estate index could potentially be used by any agent in the economy, either for decision-making in the sector, investors or by individuals buying property. For the time being, the data will be available for generating timely statistics of housing price dynamism, so that the strength of the index can be confirmed through time.



 The housing market, the real estate and the construction sector shape a market as a whole. However, there are some subtle differences among them. The construction industry produces dwellings of varying nature with different degrees of intervention. Housing, in turn, is judged to be a free market, self-regulated by suppliers and demanders in the sector. Quite the opposite, public works are embedded in non-residential areas and are entirely intervened.

In this line, Obaíd (2003) presents typical characteristics of buildings, particularly new and old, in the real estate sector:

[…] una de ellas y es evidente es el caso de las transacciones, a diferencia de otros mercados las transacciones son mucho menores, también los activos sólo pueden valorarse por referencias y se caracteriza por una baja capacidad de generar liquidez a corto plazo, esto por ser activos fijos y su capacidad de realización es lenta y difícil. Lo cual se debe a todo el trámite legal que se debe seguir y a los elevados precios de las operaciones, inclusos las más pequeñas, si se comparan con otros mercados (Salazar & Díaz, 2014, p. 17).

In Chile, as well as other countries, dwellings are an essential component of family wealth, and the main loan guarantee in the financial system. For this reason, the variation of housing prices is directly related to the household consumption, affecting the financial situation of the country and the financing entities (Parrado, Cox & Fuenzalida, 2009).

As a result, the stability of the financial system is fundamental because it offers a high capacity to grant loans to real estate companies and partnerships. This fact was evident in 2007, when the construction sector represented 9% placement of the whole banking system.

The real estate market is, therefore, relevant for the economic stability of the country, both in its role for household balance and for the financial system. It has a positive impact generating employment in the construction sector, which was close to 10% of the total workforce in 2017, and it is a contribution to the nominal GDP, reaching 14% in this period (Ortúzar, 2018).



 In order to generate the real estate price index in Región Metropolitana, we adopted a methodology based on synthetic index, particularly the Hedonic Price Model (HPM), due to the fact that the repeat purchase rate must be applied when there are price changes in dwellings sold more than once (Banco central, 20014, p.22). In addition, this method involves only old homes (Idrovo & Lennon, 2011), unlike the HPM model, where the price is explained in terms of characteristics (heterogeneous for each dwelling); i.e., the effect of every characteristic on the dwelling price is estimated by means of multivariate analysis (GLS3). Additionally, it does not distinguish between new and old homes.

This method can present disadvantages since it could generate biased estimates for two reasons: first, it avoids relevant variables directly affecting the dwelling price; second, there is an inadequate relation between specific characteristics of dwellings and its effects on the final price. Finally, it considers that dwelling peculiarities are invariable through time, which is mistaken.

The analysis is carried out through three models taking into account the following variables in their generation:

  • UF (Unidad de Fomento) value
  • Useful square metres (m2)
  • Bathrooms
  • Bedrooms
  • Operation type (house/apartment; new/old)
  • District groups in Region Metropolitana (RM)
  • Advertisement date (month and year)

Additionally, the following interactive variables were considered:

1. Area interactive variable: the multiplication of district group and useful square metres

Where Arean is the result of the multiplication of district groups of Región Metropolitana (groupn) and the useful square metres (m2u).

2. Type of dwelling interactive variable: the multiplication of the dummy variable house or apartment and m2.

C is the result of the multiplication of the dummy variable house or apartment (house/apartment) and the useful square metres (m2).

The models are expressed as follows:

Equation 1: Multivariate Regression: (r1):

Where the dwelling price (Pi) is explained by means of their different characteristics.

Equation 2: Multivariate Regression: (r2):

The second model or semi-log (r2) transforms the dependent variable into a logarithm.

Equation 3: Multivariate Regression: (r3):

The third and last model log-log (r3) expresses the price and the useful square metres in a logarithm.

The Akaike Information Criterion (AIC) was used to select one of the three models, showing that the log-log (r3) explains data with the minimum number or parameters, in comparison to the linear model, which takes into account the non-linearity between the square metres and the dwelling price, as shown in Table 1.

The coefficients and standard deviations for every explanatory model variable can be observed in Table 2. In tables 3 and 4, it is possible to observe the single coefficients for both apartment and house index.

After the selection of the model, the price index was formulated based on three kinds of methodology: Laspeyres, Paasche and Fisher

  • The Laspeyres Index4)  refers to an arithmetic mean of simple price index, used as estimation for the value of transactions carried out in a base period; i.e., in every period t (Curiel Díaz, 1997).
  • The Paasche Index5 takes into account the purchase patterns of regular buyers, showing the taste and needs of consumers within the index (Webster, 2000).
  • The Fisher Index is the ideal price index because it reduces the shortcomings resulting from the other two index, finding the square root in its product (Webster, 2000).

In addition, the models have in common the same parameters ϑ and θ corresponding to the slope and intercept, respectively.

Equation 4: Price Index

Source: Cámara Chilena de la Construcción, 2011.



The observations used for the Real Estate Index of the RM data were reduced from a total of 750,000 to 85,798, because the following dwellings were removed: all dwellings for rent, all those outside RM, and those that had no data in relation to region, district, useful square metres, bedrooms, bathrooms, UF value, and those observations generating data outliers.16% of total data were imputed because they had data in the other variables and they lacked one characteristic. Table N°5 shows the percentage of the imputed variables.

The UF value is the dependent variable of the study. It was modified according to the value stated by Banco Central de Chile at the time the dwelling was advertised. Then, all missing values was imputed6, under 150 UF and over 55,000 UF. Another variable used to establish maximum and minimum margins is the useful square metres, with a minimum of 21 m2 and a maximum of 10,000 m2. These margins were selected because some square metres in dwellings and the area they were located that in were not regulated, either because of mistyping or a wrong report on the dwellings’ characteristics.

After the imputation, we can see that out of the 85,798 observations, 49% are houses and 51% apartments. 1% correspond to new housing and 99% to old ones. This shows that they are mostly old apartments.

To make a difference between new and old real estate, we created a dummy where 1 is new and 0 is old. The same was done with houses being 1 and apartments being 0.

Regarding the number of observations, 2019 is the year with the highest number of advertised real estate with a total of 43,759, which represents a 25% increase compared to 2018. In 2018 the number of advertisements had an increase 4 times higher than the previous year, probably due to the tendencies of e-commerce in Chile. According to Centro de Economía Digital CCS (2019) financial and real estate services had an increase of 146% in the digital market in 2018. 57 districts comprise the database. They are classified into seven groups according to their frequency rate and their geographical location. Table N°7 shows these frequencies.

From the above table, we can see that group 1 has the highest number of advertisements with real estate for sale with 30,839. Las Condes is the district with the highest number of observations with a total of 13,576, followed by Santiago with 9,206 data in group 4.

With the district groups it is possible to observe the relation between square metres and the UF value. See the Dispersion graph 1.

This graph displays that the data of group 1 lean towards the left, showing an important number of small households at a high price. These high prices are due to their location, which increases the surplus value (not to be analysed in this article). In the first quarter of 2019, one-bedroom one-bathroom apartments have increased sale prices by 7.8% in comparison to the first quarter 2018. On the other hand, three-bedroom one-bathroom houses increased their prices by 10.3% whereas three-bedroom two-bathroom houses increased by 5.5%7.

The household price in UF varies mainly according to its square metres, rather than the number of rooms. The most frequently advertised houses have 3 or 4 bedrooms (70%) and the most frequently advertised apartments have 2 or 3 bedrooms (35% and 43%, respectively).

Table 8 shows the descriptive statistics for every explanatory variable in the period January 2017 – August 2019.

Two interactive variables are added. First, the area interactive variable (the multiplication of district groups and m2). Second, the type of dwelling interactive variable (the multiplication of the dummy variable house/apartment and m2).

In order to create the index, we generated a linear model expressed in terms of logarithms that do not take into account the non-linearity between the square metres and the price. After calculating the monthly prices, we used the December 2018 period estimates, as stated in the Laspeyres Index methodology. This was applied as a whole including both houses and apartments and, on the other hand, separating houses and apartments so as to reach a more accurate result for each of them. Graph 2 shows the volatile performance of prices in the index as a whole, including houses and apartments. This also happens in relation to the square metres, which, according to the primary source Banco Central de Chile, closed on the decline by July 2019.

By the second quarter of 2019 we observe that this trend on the decline is only related to the performance of apartment prices (see Graph 4) rather than houses (see Graph 5). This fact shows that apartments are the most important household for sale in the real estate market.

The variations in the performance in household prices with this index is similar to the previous one, for both houses and apartments. However, the structures of the two index are different. The Laspeyres methodology use the base period, December 2018, and the square metre average is maintained for estimations. The Paasche methodology, the coefficient of regression of the base period is kept, but the square metre of each period is used.

We highlight that the apartment prices (Graph 7) have a slow-paced performance in comparison to previous index, since May 2018.

Due to the fact that this index corresponds to the square root of the interaction between the previous two indicators, the Fisher methodology aims at diminishing the undervaluation and the overvaluation generated by the Laspeyres and Paasche ones in the different periods. Therefore, it is an intermediate result between both. However, as fluctuations are similar, the Fisher index does not have significant restraints, confirming the results previously mentioned.



The database collected by web scraping meet the quality requirements stated by Wang & Strong (2013), e.g., credibility, objectivity and relevance. This reflects that it is possible to build a tertiary source of information or non-traditional mechanism to collect information by contrasting sources. This article compares the studies carried out by Banco central de Chile and Cámara Chilena de la Construcción, deploying the reality of the real estate sector and explaining its dynamism through ongoing and prompt data.

Within the three methodologies, Laspeyres showed the highest, but marginal volatility. In addition, it presented a trend on the decline on household final prices in the second quarter 2019 due to both the low number of transactions and the stagnation of the construction sector, showing similar results to the ones provided by Banco Central de Chile.

For better results, the following changes could be made in the study:

  • Add data to the sample collected from the website TocToc from other similar databases; for example, Portal Inmobiliario, el Rastro and Yapo. This would result in an index with a wider time span to observe more slow-paced and accurate variations in prices. Additionally, a quarterly period index could be elaborated to compare the results with the index coming from primary sources in the same period.
  • Spread the index to other regions in Chile. Also, include other variables to the hedonic model of TocToc website; for instance, the amount of favourites, visits, interested people, appraised value and payment delinquency.
  • Expand the analysis using rent data to generate an index that showed the characteristics increasing or decreasing the household rent price. This is based on the fact that the Ministerio de Desarrollo Social (2017) states that the amount of people renting household is on the rise.





Banco Central de Chile (2014). Índice de Precios de Vivienda en Chile: Metodología y Resultados. Retrieved from:

Curiel Díaz J. (1997). La teoría de los índices de precios. Cuadernos de Estudios Empresariales, (7), 71-88. Madrid, España. Escuela Universitaria de Estudios Empresariales. Universidad Complutense. ISSN: 1131-6985

Centro Economía Digital CCS (2019). Tendencia del Comercio electrónico en Chile. Retrieved from:

Idrovo, B. y Lennon, J. (2011). Hedonic pricing models to calculate price indexes for new houses in the Santiago province. Munich Personal RePEc Archive. Retrieved from:

Ministerio de Desarrollo Social (2017). Resultados Vivienda y entorno. Retrieved from:

Ortúzar, R. (2018). Inversiones + Planificación  = Desarrollo Sustentable. [Diapositivas de PowerPoint,1-12, impreso]

Parrado, E.; Cox, P. y Fuenzalida, M. (2009). Evolución de los Precios de Viviendas en Chile. ResearchGate, 12(1). Retrieved from:

Salazar, M. y Díaz, M. (2014). Influencia del Desarrollo Urbano en el Mercado Inmobiliario de la ciudad de Santiago. Bachelor thesis in Business Administration, Universidad del Bio-Bio, Chillán, Chile. Retrieved from:

Webster, A. (2000). Estadística aplicada a los negocios y la economía (3rd ed.). Bogotá, Colombia: McGraw-Hill.








  1. An example is the 2007-2009 Sub-Prime Crisis in the US where Banks offered mortgages aiming at the geographic diversification of their investments, adding a greater spreading the risk of their portfolio. This fact generated a real estate bubble directly affecting the sector and other agents of economy.
  2. It is a source of information that can explain the dynamism of the real estate sector with real, timely and current data.
  3. Generalysed Least Square.
  4. The advantage of this index is that it requires data for only one period because its only varying element in time is price. However, it is disadvantageous in that it overestimates the increasing good prices, keeping the amount constant.
  5. The weakness of this index is to underestimate increasing good prices.
  6. The imputation method applied to all the variables of the study is the imputation of conditional mean for a set of data.
  7. According to Informe Trimestral de Viviendas de Portal Inmobiliario