[kaggle] titanic EDA To prediction

2021. 5. 15. 11:22

https://www.kaggle.com/ash316/eda-to-prediction-dietanic/notebook

EDA To Prediction(DieTanic)

Explore and run machine learning code with Kaggle Notebooks | Using data from Titanic - Machine Learning from Disaster

www.kaggle.com

-> 필요한 라이브러리를 import한다.

-> 데이터를 불러온다.

shape, isnull을 확인해준다.

-> feature analysis를 시작한다.

-> target의 분포를 matplotlib pie, seaborn countplot로 확인한다.

-> 이제 target이 왜 이렇게 됐는지 각 feature을 살펴보면서 그 원인에 대한 insight를 얻어보려고 한다.

part1: feature analysis

-> 성별, categorical feature을 먼저 분석해보았다.

groupby.count로 숫자로 먼저 확인을 했다.

plot.bar로 성별에 따른 인원의 비율을 확인했다.

countplot으로 성별에 따른 실제 살아남은 수를 비교했다.

-> pclass, ordinal feature을 두 번째로 분석해보았다.

pd.crosstab으로 표를 그려 숫자로 확인을 했다.

plot.bar를 통해서 pclass별로 전체적인 비율을 확인하였다.

countplot을 통해서 pclass별로 survived를 분류해보았다.

-> 위 두 개의 sex, pclass가 중요해보여서 함께 분석해보았다.

pd.crosstab을 통해서 표를 그려 숫자로 확인했다.

factorplot을 이용하여 그림으로 확인했다. ( categorical values를 쉽게 분리해줘서 자주 사용한다.)

-> Age, continous feature을 세 번째로 분석해보았다.

describe()를 통해서 최소, 최대, 평균값을 확인했다.

seaborn violinplot을 통해 pclass와 age에 따른 survived비율을 보았다.

seaborn violinplot을 통해 sex와 age에 따른 survived비율을 보았다.

null data 처리를 하기 위해서 평균 값을 이용했는데 이때, name feature를 통해서 group(mr, mrs)을 나누고 그 평균 값을 이용하여 Loc함수를 통해 null을 채웠다.

plot.hist를 통해 나이 별로 살아남은 값을 계산하였다.

-> embarked, categorical feature을 보고자했다.

pd.crosstab을 이용해서 앞선 pclass, sex, survived를 함께 보았다. -> 비율을 확인할 수 있다.

그런데 그림으로 그려보는게 더 알기 쉽기때문에 그림을 그려봤다.

factorplot으로 port별로 survived비율을 확인했다.

countplot을 사용해서 전체에서 각 port별 인원확인, 각 port별 성별별 인원확인, 각 Port별 survived 인원 확인, 마지막으로 port별로 pclass인원 확인

마지막으로 위의 내용을 모두 포함할 factorplot으로 pclass기준 survived비율확인하는데 sex랑 embarked별로 나누기.

결측치는 2개밖에 없어서 fillna함수를 통해 가장 많은 s로 대체하였다.

-> SipSp, discrete feature 분석

pd.crosstab으로 sibsp를 survived별로 구분하여 보았다.

barplot, factorplot으로 각 sibsp별 살아남은 비율을 살펴보았다.

pd.crosstab으로 sibsp를 pclass별로 구분하여 보았다.

-> Parch, discrete feature 분석

pd.crosstab으로 parch를 pclass별로 구분하여 보았다.

sns.barplot,factorplot을 이용해서 parch를 survived비율을 확인해보았다.

-> Fare, continous feature 분석

describe를 통해서 최소, 평균, 최대값을 조사하였다.

sns.distplot을 통해 pclass별로 fare을 확인하였다. -> 인원이 많을 때 사용한다.

-> 지금까지 본 feature 분석 결과를 target인 survive와 관련하여 정리한다.

sex, pclass, embarked, age, parch, sibsp

-> 이제 모든 Feature에 대한 correlation을 확인한다.

sns.heatmap을 이용하면 상관관계를 알 수 있는데, 이때 perfect positive correlation관계가 있다면 이것을 multicolinearity라고 부른다. 이것은 둘 중 하나는 redundant하기때문에 없애주면 좋다. 그러면 시간상의 문제를 해결할 수 있고 정확도에는 큰 차이가 없지만 해석에 있어서 문제가 없어진다.

part2: feature engineering

-> 이제 part2로 넘어가서 feature engineering을 시작한다.

필요없는 feature는 삭제하고 새로운 정보를 추출하여 새로운 feature을 만들기도 하고 기존에 있던 feature을 재구성할 수도 있다.

-> age feature to age_band feature

나이는 continous values라서 우리가 그룹짓기가 어렵다. 예를 들어서 성별로 그룹지으라고 하면 남자 여자로 그룹을 쉽게 정할 수 있다. 그래서 continous -> categorical로 바꾸기 위해서 binning을 사용한다. bin size를 정해서 그것으로 나눠주면된다. 여기서는 나이대를 5개로 나누었다. (새로운 Feature을 만들고 거기에 loc함수를 통해서 넣어줌)

그리고 마지막으로, dataframe.values_count, factorplot으로 age_band에 따른 survived비율을 확인하였다.

-> sibsp + parch features to alone + family_size features

sibsp, parch feature를 동시에 알아보기 위해서 family_size라는 feature을 만든다.

그리고 둘 모두 factorplot을 통해서 family_size에 따른 survived 비율을 본다.

마지막으로 혼자 있는 게 더 좋은 건지 알아보기 위해서 factorplot을 통해 alone일 때와 아닐때를 비교해본다.

-> fare feature to fare band

pd.qcut를 통해서 binning을 해줄 수 있는데, 범위만 나눠주기때문에 loc함수를 이용해서 데이터를 따로 처리해줘야한다.

아무튼 fare feature도 age와 마찬가지로 continous feature라서 categorical feature로 바꿔줘야한다.

마지막으로 factorplot을 이용하여 fare_cat에 따른 survived 비율을 확인하자.

-> converting string to numeric

sex, embarked, initial같은 string을 replace를 통해 변환해준다.

-> dropping uneeded features

name, ticket, passengerid, fare_range, fare, age, cabin은 이제 쓸모없으므로 제거한다.

-> feature analysis & feature engineering & cleaning을 통해 최종적으로 완성된 feature을 가지고 correlation을 구해본다.

sns.heatmap을 이용하면 되는데 fig = plt.gcf를 사용하여 사이즈를 조절하자.

part3: modeling

-> classification 사용하기

1) logistic regression
2) support vector machines
3) random forest
4) knn
5) naive bayes
6) decision tree
7) logistic regression

1. matplotlib의 subplots()

https://m.blog.naver.com/PostView.nhn?blogId=heygun&logNo=221520454294&proxyReferer=https:%2F%2Fwww.google.com%2F

plt.subplot()과 plt.subplots()의 차이 #5

subplot과 subplots 모두 한번에 여러 그래프를 보여주기 위해 사용되는 코드이지만 사용법이다르다. ex) ...

blog.naver.com

2. plot.pie

https://wikidocs.net/92114

위키독스

온라인 책을 제작 공유하는 플랫폼 서비스

wikidocs.net

3. sns.countplot

https://teddylee777.github.io/visualization/seaborn-tutorial-1

Seaborn의 통계 차트 및 데이터 시각화 예제

Seaborn 활용한 데이터 시각화 그래프 예제와 방법에 대하여 알아보겠습니다.

teddylee777.github.io

22분 다시 들어보기! 은메달 따는 법??

4. pandas crosstab

dataframe -> frequency table 만들기.

5.seaborn factorplot

they make the seperation of categorical values easy.

6.seaborn violinplot

데이터의 분포를 나타낸다. 양쪽 끝은 최소, 최대 값을 나타낸다.

http://growthj.link/python-seaborn-%EB%8D%B0%EC%9D%B4%ED%84%B0-%EC%8B%9C%EA%B0%81%ED%99%94-%EC%B4%9D%EC%A0%95%EB%A6%AC/

[Python] seaborn 데이터 시각화 총정리 - GROWTH.J

seaborn은 matplotlib 처럼 그래프를 그리는 기능이다(matplotlib으로 그래프 그리는 꿀팁이 궁금하다면?). matplotlip으로도 대부분의 시각화는 가능하지만 아래와 같은 이유들로 seaborn을 더 선호하는 추

growthj.link

** hue parameter determines which column should be used for colour encoding.

7. correlation 상관관계

상관관계는 인과관계라고 할 수 없지만 두 변수의 상관도를 보여주는 것이다. 데이터 분석에서 좀 더 발전된 분석을 하기 위해서 사용된다. -> redundant한 데이터를 지우기 위해서 사용한다. 예를 들어서, 두 correlation이 1이면 a는 b다. -1이면 완전 반대를 말하는데 사실 하나만 있어도 학습이 가능하니깐 같은 거는 줄여서 시간이나 그런 걸 효율적으로 하겠다는 것지.

다른 예시로 만약에 100개의 feature중에 쓸모 없는 feature 10개가 중복되어 있다고 생각해보자. 그러면 그건 쓸모없는데 있으니깐 안좋겠지?

linear regression을 하는 경우에는 더 중요해지는데 그 이유는 만약에 target 값이 집 값이라면 집 값이랑 특정 feature가 correlation 값이 1이면 엄청 중요한 feature임을 의미하기 때문이다.

https://hyperdot.wordpress.com/2016/12/26/basic-%EC%83%81%EA%B4%80%EC%84%B1correlation/

[Basic] 상관성(correlation)

이 글에서는 상관성에 대한 기초적인 소개와 몇가지 오해를 짚어보겠습니다. 상관성(correlation)이란 무엇인가? 상관성은 두 변수간의 “선형적” 관계의 정도를 의미합니다. 값은 -1 ~ 1 사이를 가

hyperdot.wordpress.com

8. seaborn heatmap and correlation

dataframe의 corr()을 사용하면 상관관계를 확인할 수 있다.

annot는 셀의 값, linewidth는 셀 사이의 거리

https://m.blog.naver.com/PostView.nhn?blogId=kiddwannabe&logNo=221205309816&proxyReferer=https:%2F%2Fwww.google.com%2F

상관관계 분석(Pandas) & Heatmap 그리기

pandas를 활용하면 대량의 데이터들을 쉽고 빠르게 살펴 볼 수 있습니다. pandas의 장점중의 하나는 행/열 ...

blog.naver.com

9. multicolinearity

상관관계가 일치하는 것을 의미한다. 이렇듯 multicolinearity가 존재하면 결과를 왜곡할 수 있기때문에 하나의 변수를 삭제해줘야한다.

https://m.blog.naver.com/PostView.nhn?blogId=kiddwannabe&logNo=221205309816&proxyReferer=https:%2F%2Fwww.google.com%2F

상관관계 분석(Pandas) & Heatmap 그리기

pandas를 활용하면 대량의 데이터들을 쉽고 빠르게 살펴 볼 수 있습니다. pandas의 장점중의 하나는 행/열 ...

blog.naver.com

그렇다면 이것을 지워야하는 이유는?

The Problem with Multicollinearity

Multicollinearity undermines the statistical significance of an independent variable. Here it is important to point out that multicollinearity does not affect the model’s predictive accuracy. The model should still do a relatively decent job predicting the target variable when multicollinearity is present. Now, I know what you are thinking. “If it does not affect the model’s ability to predict my target why should I be concerned?” While multicollinearity should not have a major impact on the model’s accuracy, it does affect the variance associated with the prediction, as well as, reducing the quality of the interpretation of the independent variables. In other words, the effect your data has on the model isn’t trustworthy. Your explanation of how the model takes the inputs to produce the output will not be reliable.

독립변수의 의미를 손상시킨다.

모델 정확도에 영향을 끼치지 않는다.

그렇다면 영향을 끼치지 않는데 내가 왜 신경써야하지?

-> While multicollinearity won’t affect your prediction it will affect your interpretation of how you got there.

--> 결과적으로는 prediction에 영향이 없지만, 어떻게 그런 결과를 얻었는 지에 대한 해석을 알기 어려워지기때문에 drop해줘야한다. 그리고 시간상의 효율도 떨어진다.

multicollinearity가 있으면 두 개의 변수가 correlation이 높아서 coefficients가 reliable하지 않고 따라서 정확한 해석이 불가능하다는 데 이게 뭔말이지?

https://blog.exploratory.io/why-multicollinearity-is-bad-and-how-to-detect-it-in-your-regression-models-e40d782e67e

Why Multicollinearity is Bad and How to Detect it in your Regression Models

The word multicollinearity sounds intimidating and a bit mysterious. But the idea behind it is not as complicated as you would think.

blog.exploratory.io

https://towardsdatascience.com/multicollinearity-why-is-it-a-problem-398b010b77ac

Multicollinearity: Why is it a problem?

How to find it and fix it

towardsdatascience.com

https://stats.stackexchange.com/questions/1149/is-there-an-intuitive-explanation-why-multicollinearity-is-a-problem-in-linear-r

Is there an intuitive explanation why multicollinearity is a problem in linear regression?

The wiki discusses the problems that arise when multicollinearity is an issue in linear regression. The basic problem is multicollinearity results in unstable parameter estimates which makes it very

stats.stackexchange.com

정리하면, 결과에는 영향을 미치지 않지만 정확한 해석이 불가능해지고 시간이 더 오래걸리는 단점이 있다.

그래서 drop을 통해 정확한 해석과 시간을 줄일 수 있다.

'이제는 사용하지 않는 공부방 > Artificial intelligence' 카테고리의 다른 글

[지도학습 모델] svm, logistic regression, random forest, naive bayes, decision tree, k-nn (0)	2021.05.20
[밑바닥부터 시작하는 딥러닝] 7강 CNN 핵심정리 (0)	2021.05.15
[kaggle] titanic tutorial 1 (0)	2021.05.14
[추천시스템] (1) LearningToRank 기초정리 (0)	2021.05.10
[추천시스템] 전체 핵심 복습 (0)	2021.05.08

나의 배움터

[kaggle] titanic EDA To prediction

The Problem with Multicollinearity

'이제는 사용하지 않는 공부방 > Artificial intelligence' 카테고리의 다른 글

+ Recent posts

티스토리툴바