Skip to content

DepartmentOfStatisticsPUE/cda-2026

Repository files navigation

Categorical Data Analysis 2025/26

Outline of lecture

  1. Categorical data - notebook
  2. Simpson's paradox - notebook
  3. Contingency tables - notebook
  4. Discrete distributions - notebook
  5. Maximum likelihood estimation - notebook
  6. Goodness of fit - notebook
  7. Linear regression - notebook
  8. Marginal effects - notebook
  9. GLM: Count data - notebook
  10. GLM: Logistic regression - notebook

Extra materials

Case study

[TBA]

Code files for download

# Topic R Python Julia Jupyter
1 Categorical data .R .py .jl .ipynb
2 Simpson's paradox .R -- -- --
3 Contingency tables .R .py .jl .ipynb
4 Discrete distributions .R .py .jl .ipynb
5 Maximum likelihood estimation .R .py .jl .ipynb
6 Optimization methods -- -- -- .ipynb
7 Goodness of fit .R .py .jl .ipynb
8 Linear regression .R .py .jl .ipynb
9 Interactions and scaling -- -- -- .ipynb
10 Marginal effects .R .py .jl .ipynb
11 GLM: Count data .R .py .jl .ipynb
12 GLM: Logistic regression -- -- -- .ipynb

Problem sets

Submit solutions as a single HTML file via Moodle.

# Topic HTML QMD Jupyter Deadline
1 Vacancy analysis (categorical data, distributions, MLE) html qmd ipynb 2026-03-31 23:59
2 Retail store analysis (GOF, linear regression, marginal effects) html qmd -- TBA
3 TBA -- -- -- TBA

Example final test

Example test -- qmd, html

Required packages / modules

  • R:
    • distributions3,
    • maxLik, rootSolve
    • vcd, fitdistrplus
    • marginaleffects, modelsummary
    • car
    • see, performance, patchwork
    • geepack
  • Python:
    • scipy, numpy, pandas
    • pingouin, matplotlib, statsmodels
  • Julia:
    • Distributions.jl, DataFrames.jl,
    • Optim.jl, Roots.jl
    • HypothesisTests.jl, StatsBase.jl
    • FreqTables.jl, CSV.jl
    • Effects.jl
    • GLM.jl

Description of the data

Source:

  • id -- company identifier
  • woj -- region (województwo) id (02, 04, ..., 32)
  • public -- is the company public (1) or private (0)?
  • size -- size of the company (small = up to 9 employees, medium = 10 to 49, big = over 49)
  • nace -- NACE (PKD) sections (1 letter)
  • nace_division -- NACE (PKD) division (2-digits, https://www.biznes.gov.pl/pl/klasyfikacja-pkd)
  • vacancies -- how many vacancies the company reported?

Sample rows from the dataset

           id woj public   size nace nace_division vacancies
    1:  27350  14      1  Large    O            84         2
    2:  26705  14      1  Large    O            84         1
    3: 257456  24      1  Large    O            84         2
    4: 183657  16      1 Medium    O            84         0
    5: 200042  18      1 Medium    O            84         0
   ---                                                      
57476: 244800  08      1 Medium    P            85         0
57477:  62309  08      1 Medium    R            93         0
57478: 106708  08      0 Medium    B            08         0
57479:  62264  08      0 Medium    B            08         0
57480: 255865  08      0  Small    C            23         0

Software versions

R version 4.4.2 (2024-10-31)
Python 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 08:22:19) [Clang 14.0.6 ]
Julia Version 1.11.3
Commit d63adeda50d (2025-01-21 19:42 UTC)

About

Categorical data analysis 2025/26

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors