Analysis of energy and major components in chromatographic signals for the diagnosis of prostate cancer

The prostate exam is an early detection tool to prevent prostate cancer and the main diagnostic tools for obtaining signs are generally invasive. This article tries chromatographic signals from the urine of prostate cancer patients and control patients as a non-invasive examination proposal. For this purpose, methodologically, urine samples are taken, digitized in chromatograms, treated with mathematical techniques and classified. The mathematical techniques are time normalization, dead time elimination, baseline correction, noise elimination, and peak alignment. Classification techniques analyze energy, in the domain of time and frequency, and the main components in sedimentation graphs and scores. As a result, the chromatographic signal is characterized and identifies the characteristic curve that represents the signal of prostate cancer patients and control patients. The data structure shows a cluster distribution of 88.88% of the vectors for the control patients. In the case of prostate cancer patients, the distribution of data is in clusters around the area defined by control patients. This characterization demarcates signal classification regions to diagnose possible prostate cancer patients, validating the relationship between the chromatographic signal and cancer.


Introduction
Prostate cancer is one of the cancers that most affects the male gender today; more than 5% of every million people are affected by this disease; In addition, the early detection tools available to prevent it and the main diagnostic instruments to obtain evidence are generally invasive, with the rectal examination and serum concentration of the specific prostate antigen being the best known. In this sense, [1] identified factors that may be related to the non-performance of the exam such as: fear of cancer, shame, discomfort, pain, low educational level, disinformation of the exam, distrust of medical professionals and concern that the rectal touch may affect masculinity; factors that are expected to be mitigated with this research, taking advantage of the increasing use of new technologies, where applications have been developed to improve health conditions worldwide [2], seeking to make the procedures as effective and as invasive as possible..
Computer-assisted diagnostic systems, which use signal processing techniques, have been widely used to diagnose diseases such as upper limb sarcopenia [3], cardiovascular diseases [4], [5], Parkinson [6] - [8], to mention a few. Likewise, prostate cancer has also been tried to diagnose using image processing techniques from the chemical treatment of a biopsy [9], [10]; others have used machine learning techniques to improve the validity of the diagnosis [11], [12], however, the method remains invasive in obtaining the sample for the analysis of the information contained therein.
However, it is possible to obtain information on prostate cancer non-invasively through chromatography, a procedure defined as the method by which chemical components are separated from a sample, which is represented by a one-dimensional signal with which it is possible to analyze delay, energy or concentration times; allowing the qualitative and quantitative identification of chemical components based on their distribution for characterization [13].
As presented, this article tries urine samples from a chromatographic process to obtain one-dimensional signals, analyzes the differentiating characteristics by applying signal processing techniques and identifies whether the signal corresponds to a patient with prostate cancer or a control patient (no prostate cancer).
Processing techniques for the characterization of chromatographic signals include time normalization, dead time elimination, baseline correction, noise elimination, signal alignment, energy analysis for feature extraction and principal component analysis for classification.
This document presents the materials and methods, describing the methodology implemented and exposes the results obtained with their respective analyses. Figure 1 shows the research methodology implemented in the sampling stages, database consolidation, signal conditioning using mathematical processing techniques and chromatogram classification. The sampling includes the search for patients diagnosed with prostate cancer and control patients and the chemical preparation of the sample. The purpose of this stage is to have the appropriate urine sample of each patient, ready to enter the chromatogram, a process that occupies 61% of the total time dedicated to this procedure. Therefore, an adequate sample preparation is a determining factor in the correct performance of a chromatography, since it guarantees the integrity of the results and eliminates the contaminants that affect the chromatogram [14].

Materials and methods
The consolidation of the database contemplates the digitization of the one-dimensional signals in text files, one per patient, whose data correspond to the intensities of the sample in millivolts. Each patient has an associated chromatogram, which is constructed iteratively until the appropriate resolution is obtained [13], [15].  The attributes of the chromatogram are: peak height (h p ) which is the portion of a component defined by its height and width at baseline; the dead time(t m ) which is the time to register the first component; retention time (t r ) considered as the time elapsed between the start of the sample and the appearance of peaks, used to identify and characterize signals; the peak -valley relationship(h p /h v ) used as a classification criterion when it is not possible to separate two consecutive peaks in the baseline, where h v is the height at the lowest point of the curve that separates the minor and major peaks above the extrapolated baseline; and finally, the peak width at half the height (W (h/2) ) and the peak area (A), parameters used to describe signal characteristics in a given context [15].
The signal conditioning adapts the data, through mathematical processing techniques, so that the characteristics of the signal can be correctly classified and validated; To do this, it uses processes such as: normalization of time, elimination of dead time, correction of baseline, elimination of high and low frequency noise, and alignment of peaks [16] - [18].
The chromatogram classification extracts the energy of each peak, in the time and frequency domain, to form characteristic vectors for each signal, which are used to classify the signals with principal component analysis. This last analysis is commonly used to validate if the chosen characteristics of a signal, in this case, the energy, are correct for classification [19] - [22].

Results and Discussion
This section is structured based on the methodology of Figure 1, presents the results of the processing of chromatographic signals from urine samples and exposes its analysis.

Sampling
Chromatographic signals were taken at the Quality Control Laboratory of the University of Pamplona, based on the analysis of urine samples. Eighteen shots were taken, nine corresponding to prostate cancer patients and nine control patients.

Database Consolidation
Eighteen text files were obtained in the .txt format, one file for each patient, which constitute the database of chromatographic signals from urine samples. Figure 3 graphically shows the chromatogram of a patient with prostate cancer and a control patient. The control patient records a prominent peak in the chromatogram, while the patient with prostate cancer differs by presenting two prominent peaks. That is, there is an indication in the chromatographic signal that relates a possible prostate cancer.  Figure 4 shows the order of the processes used for signal conditioning. The processes are time normalization, dead time elimination, baseline correction, low and high-frequency noise elimination, and peak alignment. Each process implements threads sequentially represented horizontally. Next, each process is described. Baseline Correction. Corrects errors in the signal offset concerning the axis of zeros for the correct reading of the intensities of the sample. For this, an algorithm is applied that softens the signal using a weighted moving average filter using (1), identifies the valleys using the criterion of the second derivative using (2) to form a vector with these values, interpolates a curve between two Consecutive valleys for the estimation of the baseline between them, using a second-order spline approximation by (3) and, subtracts the values of the peak baseline to correct the signal by (4).

Where: y i is the present value of the signal y i-1 is the past value of the signal c i is the weighting factor n is the maximum number of samples
Where: y''is the second derivative of the sample y i y i is the present value of the signal y i+1 is the future value of the signal y i+2 is the future value of y i+1  Figure 5 shows the graphic representations of the original signal (blue color) and the signal with the corrected baseline (orange color). The baseline correction process does not alter the chromatogram peaks, evidencing the proper functioning of the algorithm. Low frequency noise elimination. The data smoothing technique is used that helps reveal characteristics and components of the signal that can be hidden by noise, making it difficult to calculate parameters such as areas and heights [23]. In this process the algorithms are compared considering the signal to noise ratio.

Where: y i is the present value of the signal p i (v) is the quadratic polynomial of the peak baseline
The first algorithm implemented is rectangular smooth or boxcar which is an algorithm without weighted smoothing replaces each point in the signal with the average of m adjacent points, where m is a positive and odd integer so that the coefficients balance x of the peaks and other characteristics in the smoothed signal. This project defines m= 3 resulting (5).
Where: S j is the signal softened y j is the present value of the original signal y j-1 is the last value of the original signal y j+1 is the future value of the original signal Another algorithm used is the triangular smooth that implements a weighted smoothing function. For this project it is defined m=5 resulting (6).
In both cases, the integer in the denominator is the sum of the coefficients in the numerator, which results in a smooth unit gain that has no effect on the signal where it is a straight line and that preserves the area of the peak.
The pseud Gaussian o and the w width are also used, which iterate three and four times the rectangular smooth of three points, respectively. Table I shows the signal to noise ratio for the smoothing algorithms applied, where it is possible to conclude that the best results were obtained by applying four passes of a rectangular smooth threepoint.   Figure 6 shows the result of applying the w width algorithm, which eliminates the signal jumps, contributing to the identification of peaks and valleys for the calculation of the area of the peaks, in the energy analysis. With the application of the filter, the signal to noise ratio of the smoothed signal is improved by 5.14% for the chromatograms of the control patients and by 6.09% for the patients with prostate cancer.
Peak alignment. In the digitized signal of the chromatogram, misalignment and shifting of the data are common due to calibration errors of the instruments or of the environment where the procedure is performed. For this reason, the icoshift algorithm developed by [24] is implemented, consisting of three parts: definition of intervals, maximization of the cross-correlation of each interval by a Fast Fourier Transform (FFT) engine and reconstruction of the signal. This algorithm takes a reference signal and aligns other signals according to the peaks marked as representative of it, performing the process iteratively until the alignment of all the peaks is achieved. This procedure provides value in the construction of the characteristics matrix for the analysis of main components.  Each process implements threads sequentially represented horizontally. Next, each process is described.

Figure 8. Mathematical chromatogram classification techniques
Energy analysis. This process improves the resolution and symmetry of the peaks, identifies the peaks and valleys, delimits the area of each peak, calculates its energy and builds the characteristic matrix.
The sharpening technique improves the resolution by (7) and the symmetry by (8) and; it contributes to the precision of the measured areas to identify the peaks and valleys and, to delimit the area of each peak with the perpendicular drop method.
Where: R j is the signal with the improved resolution y j is the original signal y'' is the second derivative of the original signal y'''' is the fourth derivative of the original signal k 1 , k 2 are weighting factors of the second and fourth derivative respectively Where: S j is the signal with the improved symmetry y j is the original signal y' is the first derivative of the original signal k 1 It is a weighting factor The energy of each peak is calculated using (9), where the summation limits are the data between the beginning and end valleys of each peak. This calculation is performed with each chromatogram in the time domain, in the frequency domain applying FFT and, applying the Discrete Cosine Transform (DCT).
Where: E p is the energy of each peak p=1,2,3…n is the energy of each peak y i 2 is the intensity of the sample squared v i is the valley to the left of each peak v i+1 is the valley to the right of each peak As a result, three characteristic matrices are obtained, one for each analysis, formed by the energy of each peak respecting its position. In the case of absence of component, the position in the feature vector is completed with zero.
Principal component analysis. It uses a correlation algorithm between variables that uses statistical methods to describe a set of data in terms of new uncorrelated variables, called components, by the linear combination of the original variables seeking to reduce the dimensionality of the data [25]. Figure 9 shows the analysis of main components in a sedimentation plot from the energy data calculated in the time domain, with a unique characteristic curve, whose classification does not show overlap between patients. The main components 1 and 2 are those that contain the most information of the characteristic chromatograms, followed by component 3. Figure 10 shows the analysis of main components in a sedimentation plot from the energy data calculated in the frequency domain, FFT and DCT, which did not show a single characteristic curve, showing overlap between patients in their classification.  Figure 11 shows another alternative for analyzing the data calculated in the time domain using the score graph, with the main components 1 and 2 as axes, with a conglomeration of 88.88% for the Figure 11. Score chart of main components control patients. Patients with prostate cancer are presented as isolated values. This analysis shows that control and cancer patients have differences in their chromatograms, evidencing the relationship between signals and prostate cancer.

Conclusions
The sequence of the mathematical techniques of signal processing applied to the chromatograms improved the signal-to-noise ratio is 37.67% for control patients, and in 57.55% for patients with prostate cancer. This improvement contributes to the accuracy in the identification of peaks and valleys, the analysis of the energy and main components.
The sedimentation graph has a unique behavior of the main components corresponding to control patients and prostate cancer patients, validating the energy analysis of the peaks of each signal in the time domain as a differentiating factor.
In the score graph, the structure of the data shows a cluster distribution of 88.88% of the vectors for the control patients. The data representing 11.11% is considered atypical and involves an error in the inclusion of the chromatogram in the control group, which could be presented in the urine sample.
In the case of prostate cancer patients, the distribution of the data is uniform in three groups of 33.33% of the vectors around the area defined by the control patient vectors. This representation delimits signal classification regions to diagnose possible prostate cancer patients.
The results show evidence to apply the extraction of significant peaks, as a pattern extraction technique and to find other characteristics that differentiate and accentuate the classification of chromatograms of prostate cancer patients and control patients.