(181b) Self-Consistency Analysis of Physical Property and Molecular Descriptor Databases Using a Variety of Prediction Techniques | AIChE

(181b) Self-Consistency Analysis of Physical Property and Molecular Descriptor Databases Using a Variety of Prediction Techniques


Shacham, M. - Presenter, Ben Gurion University of the Negev
Paster, I., Ben Gurion University of the Negev
Brauner, N., Tel-Aviv University

      Pure-compound property data are widely
used in process design, simulation and optimization, environmental impact
assessment, hazard and operability analysis and additional diverse areas as
chemistry and chemical engineering. Presently pure compound property databases
are the (such as the DIPPR database, Rowley et al. 2010) serve as the main
sources of property data. For providing the maximal benefits to the users, these
databases typically contain both experimental and predicted data. Both types of
data are associated with certain levels of uncertainty. For experimental data
the challenge is the selection of the "best" value from several
reported values, while for predicted values only an average "prediction
error" is reported, which is based on a particular training set of
compounds and may not be applicable to the particular compound considered. New,
more accurate prediction techniques are being continuously developed that
enable replacing older less accurate predicted values by new more accurate

      However, periodical screening of
the database for identifying data points that need to be replaced represent a
great challenge, because of the huge amount of data involved. To make this task
easier, a system that can screen all the data in
the database and flag potentially erroneous or low precision data is required.
Once such a point has been identified, various
state of the art property prediction techniques, such as the multi-level group
contribution method of Marrero and Gani, 2001, asymptotic behavior correlations
(ABCs, Marano and Holder, 1997), Targeted Quantitative Structure Property
Relationships (TQSPRs, Brauner et al., 2006) and the Reference Series method
(Shacham et al., 2012) can be used to determine whether the particular value needs to be and can be replaced by a more accurate one.

      The QSPR type property prediction
methods require molecular descriptors for representing the structure of the
molecule. In recent years computer programs that can calculate several
thousands of descriptors have emerged. Checking the accuracy and consistency of
the molecular descriptors, and the correctness
of the associated molecular structure files (that often provided in MOL format),
represent additional major challenges. Flagging potentially erroneous property
data points can also help in identifying incorrect MOL files.

      In the system we have developed, the single descriptor version of the TQSPR method
(Shacham et al., 2007) is used for the initial screening of the database by selecting
in turn every compound (target compound) in the database and predicting
all its available constant properties. If the difference between the
recommended database value and the predicted value is considerably higher than
the uncertainty value assigned to the property in the data base, the data point
is flagged as potentially erroneous. The flagged data points require additional
analysis, as there can be various sources to the large differences. One
potential source is the unsuitability of the prediction technique used for the
screening. The single descriptor TQSPR method (like most/all other prediction
techniques) may not provide accurate predictions for compounds with low carbon
numbers (nC, such as for first members of homologous series)
for which properties known to change irregularly and for solid properties for nC
≤ 20, where there is different trend of
change for odd and even nC compounds. Another potential
source is an incorrect mol file, or some
erroneous molecular descriptors for the target compound. However in many cases
the large differences between the predicted and the data base recommended
values are caused by improper selection of the "recommended" value
from the available experimental data and low accuracy or inconsistency of the
available data.

      We evaluated the proposed
technique by applying it to a database that contains constant physical property
data for 1798 compounds. Included in this data base are numerical values and
data uncertainty for 32 properties (critical properties, normal melting and
boiling temperatures, heat of formation, flammability limits etc.). All the
property data is from the DIPPR database (Rowley et al., 2010). The
database contains 3224 molecular descriptors generated by the Dragon, version
5.5. software (DRAGON is copyrighted by TALETE srl, http://www.talete.mi.it)
from minimized 3-D molecular models. The molecular structure (MOL) files were
provided by Rowley, 2010.

      The results of this evaluation
will be presented in the extended abstract and the presentation. Some typical
cases where the property or the molecular structure databases needed updating
will be discussed in more detail. These examples include cases where the predicted
values for long chain substances exceeded the accepted maximal
("infinite") values of some properties, cases where the recommended
property values for a homologous series were inconsistent with the accepted
values of another series, and cases where
incorrectness of the MOL files used prevented obtaining satisfactory TQSPR


N; Stateva, R. P.; Cholakov, G. St.; Shacham, M. Structurally ?Targeted?
Quantitative Structure-Property Relationship Method for Property Prediction. Ind. Eng. Chem. Res. 2006, 45, 8430-8437.

J.J.; Holder, G.D. General Equations for Correlating the Thermo-physical
Properties of n-Paraffins, n-Olefins and other Homologous Series. 2. Asymptotic
Behavior Correlations for PVT Properties. Ind. Eng. Chem. Res. 1997A,
36, 1895.

J.; Gani, R. Group-contribution based estimation of pure component properties. Fluid
Phase Equilibrium.
2001, 183.

R.L.; Wilding, W.V.; Oscarson, J.L.; Yang, Y.; Zundel, N.A. DIPPR Data
Compilation of Pure Chemical Properties Design Institute for Physical
Properties, (http//www.aiche.org/dippr), Brigham Young University Provo Utah,

R. L. Personal communications, 2010

M.; Kahrs, O.; St Cholakov, G.; Stateva, R.; Marquardt, W.; Brauner, N. The
Role of the Dominant Descriptor in Targeted Quantitative Structure Property
Relationships, Chem. Eng. Sci. 2007, 62, (22), 6222-6233.

M.; Paster, I.; and Brauner,N.;  Property Prediction and Consistency Analysis
by a Reference Series Method, AIChE J., Accepted for Publication (2012)