Data Science/๊ฐœ๋…๊ณผ ์šฉ์–ด

Classification (Data Science), k-Nearest Neighbor Classifier (KNN) | ๋ถ„๋ฅ˜

Chan Lee 2024. 11. 29. 13:46

There are two types of predictions in data science. 

Regression์€ numerical data๋ฅผ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•˜๊ณ , 

Classification ์€ cateogorical data๋ฅผ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. 

 

์˜ˆ๋ฅผ ๋“ค์–ด, ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ์ด๋ฉ”์ผ์˜ ์ŠคํŒธ ๋ฉ”์ผํ•จ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฉ”์ผ์˜ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ŠคํŒธ์ธ์ง€ ์•„๋‹Œ์ง€, Yes or No ์— ํ•ด๋‹นํ•˜๋Š” Cateogorical variable์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. 

Input = Text / Output = Yes or No (Spam, Not Spam)

 

Classification์— ๋Œ€ํ•ด์„œ ๋” ์ž์„ธํžˆ ์•Œ์•„๋ณด๊ธฐ ์ด์ „, 

๊ฐ„๋‹จํ•˜๊ฒŒ Machine Learning์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

 

Machine Learning Algorithm

  • A mathematical model (์ˆ˜ํ•™ ๋ชจ๋ธ)
  • calculated based on sample data (์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ณ„์‚ฐ๋œ)
    • called "training data" (ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋Š” 'ํŠธ๋ ˆ์ด๋‹ ๋ฐ์ดํ„ฐ' ๋ผ๊ณ  ๋ถ€๋ฆ„)
  • that makes predictions or decisions without being explicitly programmed to perform the task.

 

์ฆ‰, machine learning algorithm์€ 

ํŠธ๋ ˆ์ด๋‹ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ณ„์‚ฐ๋œ ์ˆ˜ํ•™์  ๋ชจ๋ธ๋กœ์จ, ๊ฒฐ์ •์ด๋‚˜ ์˜ˆ์ธก์„ ํ•จ์— ์žˆ์–ด์„œ ํ•„์š”ํ•œ ๊ณผ์ •์„ ๋ชจ๋ธ ์ œ์ž‘์ž๊ฐ€ ๋ช…์‹œ์ ์œผ๋กœ ์ž…๋ ฅํ•˜์ง€ ์•Š์€ ํŠน์ง•์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ๋Œ€๋ถ€๋ถ„์˜ ํ”„๋กœ๊ทธ๋žจ ์ฒ˜๋Ÿผ ๊ฐœ๋ฐœ์ž๊ฐ€ ์ผ์ผํžˆ ๋ชจ๋“  ๊ธฐ๋Šฅ๊ณผ ๊ณผ์ •์„ ๋””์ž์ธ ํ•˜์ง€ ์•Š์•˜๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค. 

 

Technically speaking, simple linear regression ๋˜ํ•œ ๋จธ์‹  ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. 

Classificaiton์— ์‚ฌ์šฉ๋  ๋ชจ๋ธ์ธ classifier ๋˜ํ•œ machine learning algorithm์ž…๋‹ˆ๋‹ค. 

 

 


Classifier

Classifier

์—ฌ๊ธฐ์„œ Label ์ด๋ž€, classified category๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. (ex. Spam, Not Spam) 

 

์šฐ๋ฆฌ๋Š” ๋ชจ์ง‘๋‹จ (population)์—์„œ sample data๋ฅผ ์ˆ˜์ง‘ํ•œ ๋’ค, ์ด ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ classification์„ ํ•˜๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. 

์ด ๋•Œ, ์šฐ๋ฆฌ๋Š” sample์„ Training set๊ณผ Test set ์œผ๋กœ ๋‚˜๋ˆˆ ๋’ค, training set ๋งŒ์„ ์ด์šฉํ•ด์„œ ๋ชจ๋ธ์„ ํŠธ๋ ˆ์ด๋‹ํ•ฉ๋‹ˆ๋‹ค.

 

์ด๋Š” overfitting (๊ณผ์ ํ•ฉ) ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด์„œ์ž…๋‹ˆ๋‹ค. 

๋ชจ๋“  ๋ชจ๋ธ์„ ํŠธ๋ ˆ์ด๋‹์„ ํ•˜๋ฉด ํ•  ์ˆ˜๋ก ํ•ด๋‹น ํŠธ๋ ˆ์ด๋‹ ๋ฐ์ดํ„ฐ์— ํŠนํ™”๋œ ๋ชจ๋ธ์ด ๊ตฌํ˜„๋˜๊ธฐ ๋งˆ๋ จ์ž…๋‹ˆ๋‹ค. 

๋‹ค๋ฅด๊ฒŒ ๋ณด์ž๋ฉด, bias towards the training set์€ ๋ถˆ๊ฐ€ํ”ผํ•ฉ๋‹ˆ๋‹ค. 

 

ํ•˜์ง€๋งŒ, ๋งŒ์•ฝ training set ๊ทธ ์ž์ฒด๊ฐ€ population์„ ์ถฉ๋ถ„ํžˆ ๋Œ€ํ‘œํ•˜์ง€ ๋ชปํ•œ๋‹ค๋ฉด? 

population์— ์ ์šฉํ•˜๊ณ ์ž ํ•˜๋Š” ์ตœ์ข… ๋ชฉํ‘œ๊ฐ€ ๋ฌด์ƒ‰ํ•˜๊ฒŒ, ์ฃผ์–ด์ง„ ํŠธ๋ ˆ์ด๋‹ ๋ฐ์ดํ„ฐ์—๋งŒ ์™„๋ฒฝํ•œ ๋ฌด์˜๋ฏธํ•œ ๋ชจ๋ธ์ด ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค.

์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” sample data๋ฅผ ๋‘˜๋กœ ๋‚˜๋ˆ ์„œ, ํŠธ๋ ˆ์ด๋‹ ์ดํ›„ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜๋Š” ๊ณผ์ •์—์„œ test set์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. 

 

์ถ”๊ฐ€์ ์œผ๋กœ, sample data์˜ ๋‹ค์–‘์„ฑ ๋˜ํ•œ ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. 

์˜ˆ๋ฅผ ๋“ค์–ด ์‚ฌ์ง„์„ ํ†ตํ•ด ๊ฐ•์•„์ง€์™€ ๋Š‘๋Œ€๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” classifier๋ฅผ ๊ตฌํ˜„ํ•˜์˜€๋Š”๋ฐ, 

ํ•™์Šต ๋ฐ์ดํ„ฐ์—์„œ ๋ชจ๋“  ๋Š‘๋Œ€ ์‚ฌ์ง„์˜ ๋ฐฐ๊ฒฝ์—๋Š” ๋ˆˆ์ด ์žˆ๋Š” ๊ฒจ์šธ์˜ ์ด๋ฏธ์ง€๋ผ๊ณ  ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด ํ•ด๋‹น ๋ชจ๋ธ์€ ๋ฐฐ๊ฒฝ์— ๋ˆˆ์ด ์žˆ๋Š” ๊ฒจ์šธ๋‚ ์˜ ๊ฐ•์•„์ง€์˜ ์ด๋ฏธ์ง€๋ฅผ ๋งค์šฐ ๋†’์€ ํ™•๋ฅ ๋กœ ๋Š‘๋Œ€๋ผ๊ณ  ์˜ˆ์ธกํ•  ๊ฒƒ ์ž…๋‹ˆ๋‹ค. 

์ด๋Ÿฌํ•œ ๋ฌธ์ œ ์—ญ์‹œ ๋ฐ์ดํ„ฐ์˜ ๋‹ค์–‘์„ฑ์„ ํ†ตํ•ด ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ๊ฒ ์Šต๋‹ˆ๋‹ค.

 


Nearest Neighbor Classifier & k-Nearest Neighbor Classifier (KNN)

๊ฐ€์žฅ ๊ธฐ์ดˆ์ ์ธ Classifier์˜ ์ข…๋ฅ˜๋กœ Nearest Neighbor Classifier ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. 

 

์ด๋Š” ์ฃผ์–ด์ง„ data point์™€ ๋‹ค๋ฅธ ๋ชจ๋“  data point ์ค‘์—์„œ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ฐ์ดํ„ฐ์˜ label์„ ๊ฐ€์ ธ์˜ค๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. 

์—ฌ๊ธฐ์„œ ๊ฐ€๊น๋‹ค์˜ ์ •์˜๋Š”, ํ”ผํƒ€๊ณ ๋ผ์Šค ์ •๋ฆฌ์—์„œ distance๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ n์ฐจ์›์œผ๋กœ ํ™•์žฅํ•œ ํ˜•์‹์ž…๋‹ˆ๋‹ค.

(n์ฐจ์›์—์„œ๋„ ํ”ผํƒ€๊ณ ๋ผ์Šค ์ •๋ฆฌ๊ฐ€ ์„ฑ๋ฆฝํ•œ๋‹ค์˜ ์ฆ๋ช…์€ ๋‹ค๋ฃจ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.)

distance formula

 

ํ•˜์ง€๋งŒ, ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์•„๋ฌด๋ž˜๋„ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ํ•œ๊ฐœ์˜ ์  ๋งŒ์„ ํ™•์ธํ•˜๋‹ˆ, ์ •ํ™•๋„๊ฐ€ ์กฐ๊ธˆ ๋–จ์–ด์ง€๊ฒ ์ฃ ? 

๊ทธ๋ž˜์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์„ ํ˜ธ๋˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ k-Nearest Neighbor Classifier ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

 

์ด๋Š” ๊ฑฐ์˜ ๋™์ผํ•˜์ง€๋งŒ, ๊ฐœ์˜ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ ๋“ค์„ ํ™•์ธํ•˜์—ฌ ๊ทธ ์ค‘ ๊ณผ๋ฐ˜์ˆ˜์ธ label์„ ์ฐจ์šฉํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. 

(์ง๊ด€์ ์œผ๋กœ, k๋ฅผ ์ง์ˆ˜๋กœ ์ง€์ •ํ•˜๋ฉด label์ด ์ ˆ๋ฐ˜์œผ๋กœ ๋‚˜๋‰  ์ˆ˜ ์žˆ๊ธฐ์— ํ™€์ˆ˜๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.) 

 

 

* Standardize If Necessary

์šฐ๋ฆฌ์˜ ๋ฐ์ดํ„ฐ์—์„œ ๋ชจ๋“  column (variable)์€ ๊ฐ™์€ data range๋ฅผ ๊ฐ€์ง€์ง€ ์•Š์Šต๋‹ˆ๋‹ค. 

 

Suppose in the data, one variable is the age and another variable is the annual income. 

์šฐ๋ฆฌ๊ฐ€ ๋‚˜์ด, ์—ฐ ์ˆ˜์ž…, ์‹ ์žฅ, ๋“ฑ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํŠน์ • ์ธ๋ฌผ์˜ ์ •์น˜์  ์„ฑํ–ฅ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ๊ตฌํ˜„์ค‘์ด๋ผ๊ณ  ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

๋‚˜์ด๋Š” ๊ทน๋‹จ์ ์œผ๋กœ ์ฐจ์ด๋‚˜ ๋ด์•ผ 100์„ธ์ด์ง€๋งŒ, ์—ฐ์ˆ˜์ž…์˜ ๋ฒ”์œ„๋Š” ๋งค์šฐ ๊ด‘๋ฒ”์œ„ํ•˜์—ฌ ์ˆ˜์‹ญ์–ต์—์„œ ๊ทธ ์ด์ƒ์˜ ๋ฒ”์œ„๋„ ์กด์žฌํ•  ๊ฒƒ ์ž…๋‹ˆ๋‹ค. 

์ด๋Ÿฐ ๊ฒฝ์šฐ, ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ standardize (z-score๋กœ ๋ณ€ํ™˜)ํ•˜๋Š” ๊ฒƒ์ด ๋ฐ”๋žŒ์งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.