Tutorials

What are Machine Learning Models?

Machine learning models are algorithms that learn from data to make predictions or find patterns. They analyze past information and apply it to new data to classify objects, predict outcomes, or group similar items.

There are three main types of machine learning models:

  1. Classification – Predicts categories or labels (e.g., spam vs. not spam).
  2. Regression – Predicts continuous numerical values (e.g., house prices).
  3. Clustering – Groups similar data points without predefined labels (e.g., customer segmentation).

These models are widely used in applications like fraud detection, medical diagnosis, and recommendation systems. 🚀

Types of Machine Learning Models

Model Definition: When to Use: Real-World Examples: Popular Algorithms:
Classification Classification models predict categorical outcomes by assigning data to predefined labels or classes.
  • When the outcome is discrete or belongs to specific categories.
  • When you need to determine "which group" a data point falls into.
    • Spam Detection (Spam vs. Not Spam)
    • Credit Approval (Approve/Reject)
    • Disease Diagnosis (Healthy vs. Diseased)
    • Sentiment Analysis (Positive, Neutral, Negative)
    • Logistic Regression
    • Decision Trees & Random Forest
    • Naive Bayes
    • Support Vector Machines (SVM)
    • k-Nearest Neighbors (kNN)
    • Neural Networks (for multi-class classification)
    Regression Regression models predict continuous numerical outputs. The target variable has an infinite range of possible values.
  • When the outcome is numerical or continuous.
  • When you need to predict a specific value.
    • House Price Prediction (based on size, location, etc.)
    • Stock Market Forecasting
    • Estimating Insurance Premiums
    • Predicting Weather Metrics
    • Linear Regression
    • Decision Trees & Random Forest (for regression)
    • Support Vector Regression (SVR)
    • Ridge and Lasso Regression
    • Gradient Boosting (XGBoost, LightGBM)
    • Neural Networks (for time-series or complex regression tasks)
    Clustering Clustering is an unsupervised learning technique that groups similar data points together. Unlike classification, clustering does not require predefined labels—the model discovers patterns in the data.
    • When you don’t have labeled data.
    • When you need to find hidden patterns or groupings.
    • Customer Segmentation (based on purchasing behavior)
    • Document Clustering (grouping similar documents)
    • Image Segmentation (grouping pixels with similar colors/textures)
    • K-Means Clustering
    • Hierarchical Clustering
    • DBSCAN (Density-Based Clustering)
    • Gaussian Mixture Models (GMM)

    Subcategories of Machine Learning Models

    Model What it Does Example What You Tweak Key Hyperparameters Input Data Output
    Regression Models (Predict Continuous Values)
    Simple Linear Regression Draws a straight line to predict one value based on another. Predicting sales based on ad spend. The slope of the line. None Numeric (1 independent, 1 dependent variable) Numeric (predicted value)
    Multiple Linear Regression Uses multiple inputs to predict a single value. Car price prediction (based on age, mileage, etc.). Choosing the right inputs. None Numeric (multiple independent variables) Numeric (predicted value)
    Decision Tree Regression Uses decision rules to predict numbers. House price prediction based on size. Tree depth, split criteria. max_depth, min_samples_split, criterion Numeric or categorical (labeled data) Numeric (predicted value)
    Random Forest Regression Uses multiple decision trees for better accuracy. Predicting monthly electricity usage. Number of trees, tree depth. n_estimators, max_depth, criterion Numeric or categorical (labeled data) Numeric (predicted value)
    KNN Regression Averages the values of the closest data points. Predicting house prices based on nearby houses. Number of neighbors. n_neighbors, weights, metric Numeric or categorical (labeled data) Numeric (predicted value)
    SVM Regression Predicts numbers within a margin (similar to classification). Predicting rental prices with a margin for error. Same as SVM classification with margin settings. C, kernel, gamma, epsilon Numeric (labeled data with target variable) Numeric (predicted value within a margin)
    Classification Models (Predict Categorical Outputs)
    Logistic Regression Predicts probability for binary classification (yes/no). Will a loan be approved? Feature weights. penalty, C, solver Numeric (binary labeled data) Categorical (0 or 1)
    Decision Tree Classification Uses decision rules to classify data. Spam vs. Not Spam Tree depth, splitting rules. max_depth, min_samples_split, criterion Numeric or categorical (labeled data) Categorical (predicted class)
    Random Forest Classification Uses multiple trees to improve classification accuracy. Diagnosing diseases (e.g., flu or not flu). Number of trees, tree depth. n_estimators, max_depth, criterion Numeric or categorical (labeled data) Categorical (predicted class)
    KNN Classification Assigns a category based on nearest neighbors. Identifying fruit type based on attributes. Number of neighbors. n_neighbors, weights, metric Numeric or categorical (labeled data) Categorical (predicted class)
    SVM Classification Finds the best boundary to separate categories. Sorting emails into spam or not spam. Shape of the boundary, margin. C, kernel, gamma Numeric or categorical (labeled data) Categorical (predicted class)
    Clustering Models (Unsupervised Learning)
    K-Means Clustering Groups numerical data into clusters based on similarity. Customer segmentation by spending habits. Number of clusters. n_clusters, init, max_iter, tol Numeric (unlabeled data) Categorical (cluster assignments)
    K-Modes Clustering Groups categorical data into clusters based on similarity. Customer segmentation based on categorical traits like gender, job. Number of clusters, initialization method n_clusters, init, n_init Categorical Categorical (cluster assignments)
    K-Prototypes Clustering Groups mixed data (numerical + categorical) into clusters based on similarity. Customer segmentation by age, salary, and job type. Number of clusters, initialization method n_clusters, init, random_state Mixed (numerical + categorical data) Categorical (cluster assignments)
    Text Processing
    Text Classification (Natural Language Processing or NLP) Groups text into categories. Grouping product reviews as positive, negative, or neutral. Preprocessing, vectorization, model type vectorizer, model_type, max_features Raw or preprocessed text Predicted class labels (e.g., "positive")

    Choosing Between Classification, Regression, and Clustering

    To determine the right model, consider the following:

    1. What type of target variable do you have?
      • Categorical (labels): Use classification (e.g., spam vs. not spam).
      • Continuous (numerical values): Use regression (e.g., predicting house prices).
      • No predefined target variable: Use clustering to discover patterns and create labels.
    2. What question are you trying to answer?
      • Predicting a category? → Classification
      • Predicting a numerical value? → Regression
      • Grouping similar data points? → Clustering

    How to use Data Prep

    Step 1: Upload a Data File


    Click the "Upload CSV or Excel file" input field.
    Select a CSV or Excel file from your device.

    Click the "Upload" button to send the file for processing.

    Step 2: View Uploaded Data

    After uploading, you will see:
    • Data Information: Metadata about the dataset (e.g., column names, data types).
    • Data Preview - Head: Displays the first few rows of the dataset.
    • Data Preview - Tail: Displays the last few rows of the dataset.
    • Null Values: A list of columns with null values and their counts.
    • Unique Value Counts: A list of columns and the number of unique values they contain.
    • Duplicate Rows: The count of duplicate rows in the dataset.
    • Step 3: Perform Operations on the Data

      Use the dropdown menu to choose an operation (e.g., remove duplicates, change data type, handle null values, etc.).
      Depending on your selection, additional input fields will appear. Fill in the required details:
    • Add Column: Specify the column name and either a static value or a derived expression (e.g., col1 + col2).
    • Handle Null Values: Choose a strategy (drop rows or fill null values) and provide a fill value if needed.
    • Change Data Type: Specify the column name and the new data type (e.g., int, float, or str).
    • Sort Column: Specify the column and sort order (ascending or descending).
    • Find and Replace: Specify the column, the value to find, and the value to replace it with.

    • Click the Apply Operation button to execute the selected operation.

      To Concatenate or Merge Files:

      Use the Concatenate or Merge Files section if you want to combine more data files to the initially uploaded data file

      Upload using the Choose File button. You may choose multiple files.
      Choose an operation:
    • Concatenate: Combine the files vertically.
    • Merge: Merge the files based on a specific column. You’ll need to:
    • Provide the column name to merge on.
      Select a merge type (inner, outer, left, or right).
      Click the Upload and Process button to process the files.

    Step 4: Export Processed Data

    If you've processed the data and want to download it:
    In the Export Data section, choose whether to export the data (Yes/No).
    Select the export format (CSV or Excel).
    Click the Download button to save the processed file to your computer.

    How to use Data Explore

    Step 1: Upload a Data File

    Click the "Upload CSV or Excel file" input field.
    Select a CSV or Excel file from your device.
    Click the "Upload" button to send the file for processing.

    Step 2: View Data Analysis Results

    After submitting, the page will refresh and display various insights about your data:

    Data Information:

    • Number of rows and columns.
    • Dataset structure (e.g., data types, null values, etc.)

    Dataset Preview:

    • A preview of the first 5 and last 5 rows of the dataset.

    Summary Statistics:

    • Statistical summaries like mean, median, etc., for numerical columns.

    Null Values:

    • Columns with missing values and their respective counts.

    Unique Values:

    • Count of unique values in each column.

    Duplicates:

    • Number of duplicate rows in the dataset.

    Skewness and Kurtosis:

    • Measures of data distribution for numerical columns.

    Step 3: Generate Graphs

    Select Graph Type:
    Options include scatter plot, line graph, bar graph, pair plot, box plot, and histogram.

    Choose X and Y Columns:
    Use the dropdowns to select the column(s) for the X and (optional) Y axes.
    Y-axis is optional for certain graph types like histograms.

    Generate the Graph:
    Click "Generate Graph" to produce the visualization.

    If successful, the graph will appear below this section.

    Build an AI Model

    Choose the AI model that you'd like to build. You may choose from the various Classification, Regression and Clustering models available with us.
    In this example we are using Simple Linear Regression Model to explain the UI and functionality of the platform. However, all models are designed similarly and you will have no trouble following the steps, with the exception of Hyperparameter tuning, which is different for each AI model. Please see the AI Models Info page for more information on Hyperparatmers. In case you are not familiar with this, do not worry as you the platform is design to function even if you don't key in the hyperparameters. You may use the default as is to build your AI Model.

    Simple Linear Regression

    Step 1: Upload a Data File

    Select a file (Excel or CSV format) using the "Select File" input.
    Enter the number of rows to display in the "Number of Rows to Display" field (default is 5).
    Click the "Upload" button.

    This will send the file to the server for processing.
    After uploading, you will see the data's information (e.g., structure, column names) and a sample preview.

    Step 2: Train the Model

    Input Independent Variable Columns:
    Specify which columns from your dataset will be used as input features.
    Example: Enter 0,1,2 for the 1st, 2nd, and 3rd columns.

    Import Note: The column number always starts with 0. Refer to the column number in Data Info section for clarity

    Output Variable Column (Y): Specify the column representing the target variable. Example: Enter 3 for the 4th column.

    Scaling and Encoding (Optional): Choose a scaling or encoding method from the dropdown.
    Options include "Standard Scalar," "MinMax Scalar," or encoding techniques like "OneHot Encoder"
    If encoding is selected, specify the column to be encoded in the "Enter Column for Encoding" field.

    Test Size: Define the proportion of data to be used for testing (e.g., 0.25 for 25%).

    n_jobs (Optional): Specify the number of parallel jobs to use during model training (default is 0 for no parallelism).

    Evaluation Method:
    Select how you want the model to be evaluated:
    Options include RMSE, R-squared, Coefficients, Intercept, etc.

    Click the "Train Model" button.

    This triggers the server to train the model using the specified parameters.
    If errors occur, they will be displayed in the "Model Evaluation" section.

    Step 3: Make Predictions

    Ensure the model is trained successfully.
    If successful, the "Predict with Real Data" section will become visible.
    This means that the model has been trained and ready for use with new data (unseen data)

    Upload a file which has new data and for which you want the trained model to predict the output:

    Select a file (Excel or CSV format) containing the data for which you want predictions.
    Enter the input variable columns for prediction (e.g., 1,2,3).

    Note: If you are using one of the Encodings, the encoding will be applied automatically to the same input columns as in the model training, above step.

    Click the "Make Prediction" button.

    Predictions will be generated and displayed for the top 5 rows.
    To see the full data, proceed to export the predictions.

    Step 4: Export Predictions

    Choose whether to export predictions:
    Select "Yes" from the "Export Choice" dropdown if you want to save the predictions.
    Select Export Format:
    Choose "CSV" or "Excel" as the file format.
    Click the "Download" button.

    This will download the prediction results in the chosen format.