Page not found – An online community for showcasing R & Python tutorials https://datascienceplus.com An online community for showcasing R & Python tutorials Mon, 18 Apr 2022 01:02:48 +0000 en-US hourly 1 https://wordpress.org/?v=5.9 https://datascienceplus.com/wp-content/uploads/2018/07/cropped-favicon-2-2-32x32.png Page not found – An online community for showcasing R & Python tutorials https://datascienceplus.com 32 32 Linking R and Python to retrieve financial data and plot a candlestick https://datascienceplus.com/linking-r-and-python-to-retrieve-financial-data-and-plot-a-candlestick/ https://datascienceplus.com/linking-r-and-python-to-retrieve-financial-data-and-plot-a-candlestick/#respond Mon, 18 Apr 2022 01:02:48 +0000 https://datascienceplus.com/?p=32118 Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

Category

Tags

I am way more experienced with R than with Python and prefer to code in this language when possible. This applies, especially when it is about visualizations. Plotly and ggplot2 are fantastic packages that provide a lot of flexibility. However, every language has its limitations, and the best results stem from their efficient combination. This […]

Related Post

]]>
Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

Category

Tags

I am way more experienced with R than with Python and prefer to code in this language when possible. This applies, especially when it is about visualizations. Plotly and ggplot2 are fantastic packages that provide a lot of flexibility. However, every language has its limitations, and the best results stem from their efficient combination.

This week, I created the candlestick below, and I think it’s an excellent case study to illustrate a few things:

  • How to download financial data from investing.com using the investpy package in Python
  • How to efficiently combine the capabilities of Python and R deploying the reticulate package
  • How to construct a nicely formatted candlestick chart with ggplot2, ggthemes and two simple custom functions
  • How to export the result in different image formats, including high-resolution Scalable Vector Graphics (SVG)
  • The Python part

    Let’s start with the Python code required. First, we need to install the investpy package using pip to run the simple function below. Investpy is a fantastic and very powerful wrapper around the public API of the investing.com page. It allows the retrieval of end of day price data for a wide range of financial instruments, including stocks, bonds, ETFs, mutual funds, indices, currencies, commodities and cryptocurrencies, as well as the download of selected meta-data. Detailed documentation can be found here or in pdf format under this link. Save the function defined below in a python script.

    #pip install investpy
    
    def get_fx_cross_investpy(currency_cross,st_date,ed_date):    
        import investpy
        data = investpy.get_currency_cross_historical_data(currency_cross=currency_cross, from_date=st_date, to_date=ed_date)
        return(data)
    

    The R part

    To use the previously defined Python function in R and to subsequently plot the data, we require the following four packages that can be installed easily from CRAN.

    install.packages("reticulate")
    install.packages("ggplot2")
    install.packages("ggthemes")
    install.packages("scales")
    

    Defining a pretty theme

    The ggthemes package comes with a few nice default themes for ggplot2 graphics. So you can, for instance, replicate the famous design of the Economist or the appearance of typical Stata charts. However, it is also possible to adapt these themes and create your unique default layout. I demonstrate this below for my standard formatting. The function defined here is later used in the candlestick function.

    theme_aq_black_default_font<-
      function (base_size = 12, base_family = "") 
      {
        library(ggplot2)
        library(ggthemes)
        library(scales)
        col_aq2<-as.character(c("#04103b","#dd0400","#3b5171","#5777a7","#969696","#BDBDBD","#D9D9D9","#F0F0F0"))
        
        theme_hc(base_size = base_size, base_family = base_family) %+replace% 
    

    The candlestick function

    Candlesticks are widely used in the visualization of price data and technical analysis. It allows viewers to quickly gauge the significance of market moves and analyze potential resistance levels or extraordinary price jumps that may be reverted in the future. To construct the daily candlestick displayed above, we require daily opening and closing prices as well as intraday highs and lows. Fortunately, this is all available on investing.com and can be retrieved as a handy data frame with our function defined above.

    ggplot_candlestick<-function(df,width=0.9,chart_title,chart_subtitle)
    {
    	library(ggplot2)
      df$Date<-row.names(df)
      df$Date<-as.Date(df$Date,"%Y-%m-%d")
      df$chg  df$Open, "dn", "up")
      cols<-as.character(c("#04103b","#dd0400","#3b5171","#5777a7","#969696","#BDBDBD","#D9D9D9","#F0F0F0"))
      
      p<-
        ggplot(data=df,aes(x=as.Date(Date), y=High))+
        geom_linerange(aes(ymin=Low, ymax=High)) +
        geom_rect(aes(xmin = Date - width/2 * 0.9, xmax = Date + width/2 * 0.9, ymin = pmin(Open, Close), ymax = pmax(Open, Close), fill = df$chg)) + 
        scale_fill_manual(values = c("up" = "darkred", "dn" = "darkgreen"))+
        scale_colour_manual(values = cols)+
        theme_aq_black_default_font(base_size=18)+
        labs(color='')+
        labs(title=chart_title,subtitle=chart_subtitle,x ="")+
        labs(caption = paste0('Source: DataScience+, Investing.com  ', Sys.Date()))+
        guides(colour = guide_legend(nrow = 1))+
        scale_x_date(labels = date_format("%y/%m"))+
        theme(legend.position = "none",legend.margin=margin(-20,-20,-20,-20),legend.box.margin=margin(0,0,30,0))+
        ylab("")+
        theme(plot.margin=margin(l=5,r=20,b=5,t=5))
    
      return(p)
    }
    
    

    Plot the data and export the graphic

    Last but not least, let’s combine all these modules and execute them step by step. Once we have loaded our Python function employing the reticulate package, we can use it in R to retrieve the financial data from investpy. We can subsequently use our previously defined R functions to create the candlestick plot. The plot can then be exported easily as a PNG or SVG graphic utilizing ggsave.

    # Load the python function and retrieve the financial data
    library(reticulate)
    source_python("C:/Users/Fabian/Desktop/get_rates_investing.com.py")
    df<-get_fx_cross_investpy("USD/RUB",'01/01/2022','01/05/2022')   
    
    # Use the R functions and plot the data
    p<-ggplot_candlestick(df,chart_title="Following its crash, the Russian Ruble rebounded sharply",chart_subtitle="USD/RUB exchange rate")
    p
    
    # Save the plot
    target_folder<-"C:/Users/Fabian/Desktop/"
    ggsave(file=paste0(target_folder,"candlestick_usd_rub.svg"), plot=p, width=9, height=5)
    ggsave(file=paste0(target_folder,"candlestick_usd_rub.png"), plot=p, width=9, height=5)
    

    Related Post

    ]]>
    https://datascienceplus.com/linking-r-and-python-to-retrieve-financial-data-and-plot-a-candlestick/feed/ 0
    An R alternative to pairs for -omics QC https://datascienceplus.com/an-r-alternative-to-pairs-for-omics-qc/ https://datascienceplus.com/an-r-alternative-to-pairs-for-omics-qc/#respond Mon, 28 Mar 2022 01:23:19 +0000 https://datascienceplus.com/?p=32097 Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

    Category

    Tags

    Introduction The Problem: I've got a couple of problems with the commonly used “pairs” plot in R for quality control in -omics data. (1) It's not that space-efficient since it only uses half the space for datapoints and (2) the scatterplot isn't very informative. When I look at those scatter plots it's hard to tell […]

    Related Post

    ]]>
    Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

    Category

    Tags

    Introduction

    The Problem: I've got a couple of problems with the commonly used “pairs” plot in R for quality control in -omics data. (1) It's not that space-efficient since it only uses half the space for datapoints and (2) the scatterplot isn't very informative. When I look at those scatter plots it's hard to tell anything about the spread of the data or any normalization problems. This is particularily true for proteomics data where the high dynamic range can obscure lower abundance points.

    The Solution: A panel of MA plots (Minus-Average). The MA plot shows fold-change vs average intensity for a pair of samples. It lets you see difference between sample groups as fold-change which I think is a useful unit for comparison and visualizes normalization problems. Rather than plot each against each we will only compare samples between groups to save space.

    This goes along with a previous post of mine that attempts to convince biologists of the value of putting tables of data into tidy format. This method takes advantage of pivoting data to succintly generate a panel of MA plots

    suppressPackageStartupMessages(library(tidyverse))
    library(GGally)
    library(DEFormats)
    

    Set up the data

    I'll start with simulated data that will resemble a gene expression study. A proteomics dataset would be similar. The dataset will have 8 samples, half of them treated, half control. 7 of the samples will approximately the same but Sample 4 will have a 3-fold increase compared to the rest to illustrate how MA-plots help identify problems with normalization.

    counts <- simulateRnaSeqData(n = 5000, m = 8)
    counts[, 4] <- counts[, 4] *  3
    
    targets <- data.frame(sample = colnames(counts), group = c(rep("control", 4), rep("treated", 4)))
    

    The ggpairs function from the GGally package does a decent job of the pairs plot.

    ggpairs(data.frame(counts))
    

    The pairs plot tells us something about the data. The correlation is nice to have and if any sample was wildly different from the others it would show up in the scatter plot. Still I don't think it conveys the information very efficiently.

    MA plot panel

    Typically I would start by pivoting all the count data to a single columns and joining in the metadata. But I need to associate control and treated data for each gene for each sample so the usual method won't work. It took me a while to fall on the solution: we have to pivot the control and treated samples separately. So for each gene we will have a control sample name, a treated sample name and control and treated count data. Those can be used to calculate Fold-change and intensity.

    control_samples <- targets$sample[targets$group == "control"]
    treated_samples <- targets$sample[targets$group == "treated"]
    
    data.frame(counts) %>%
      rownames_to_column("gene") %>%
      pivot_longer(all_of(control_samples), names_to = "control_sample", values_to = "control_count") %>%
      pivot_longer(all_of(treated_samples), names_to = "treated_sample", values_to = "treated_count") %>%
      mutate(FC = treated_count / control_count) %>%
      mutate(Intensity = (treated_count + control_count) / 2)
    ## # A tibble: 80,000 x 7 
    ##    gene  control_sample control_count treated_sample treated_count    FC Intensity
    ##    <chr> <chr>                  <dbl> <chr>                  <dbl> <dbl>     <dbl>
    ##  1 gene1 sample1                  103 sample5                   71 0.689      87  
    ##  2 gene1 sample1                  103 sample6                   79 0.767      91  
    ##  3 gene1 sample1                  103 sample7                   76 0.738      89.5
    ##  4 gene1 sample1                  103 sample8                  118 1.15      110. 
    ##  5 gene1 sample2                   82 sample5                   71 0.866      76.5
    ##  6 gene1 sample2                   82 sample6                   79 0.963      80.5
    ##  7 gene1 sample2                   82 sample7                   76 0.927      79  
    ##  8 gene1 sample2                   82 sample8                  118 1.44      100  
    ##  9 gene1 sample3                   89 sample5                   71 0.798      80  
    ## 10 gene1 sample3                   89 sample6                   79 0.888      84  
    ## # ... with 79,990 more rows
    

    All that's left to do now is plot. Facet_grid will let us split the samples up.

    data.frame(counts) %>%
      rownames_to_column("gene") %>%
      pivot_longer(all_of(control_samples), names_to = "control_sample", values_to = "control_count") %>%
      pivot_longer(all_of(treated_samples), names_to = "treated_sample", values_to = "treated_count") %>%
      mutate(FC = treated_count / control_count) %>%
      mutate(Intensity = (treated_count + control_count) / 2) %>%
      ggplot(aes(x = Intensity, y = FC)) +
      geom_point(alpha = 0.5, na.rm = TRUE) +
      scale_x_continuous(trans = "log10") +
      scale_y_continuous(trans = "log2", breaks = 2^seq(-4, 4, 2)) +
      geom_hline(yintercept = 1) +
      labs(x = "Intensity", y = "Fold Change, treated vs control") +
      facet_grid(rows = vars(treated_sample), cols = vars(control_sample))
    

    The change in abundance in sample 4 shows up much more clearly now. This isn't a common way to plot the data so it might require some explaining to your colleagues but worth the effort in my opinion.

    Related Post

    ]]>
    https://datascienceplus.com/an-r-alternative-to-pairs-for-omics-qc/feed/ 0
    Visualizing economic data with pretty worldmaps https://datascienceplus.com/visualizing-economic-data-with-pretty-worldmaps/ https://datascienceplus.com/visualizing-economic-data-with-pretty-worldmaps/#respond Tue, 22 Mar 2022 18:56:14 +0000 https://datascienceplus.com/?p=32081 Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

    Category

    Tags

    Choropleths are a nice tool for the visualization of geographic data and with R and Python, their creation can be pretty effortless, especially when clean data is readily available. Fortunately, a lot of economic datasets can be loaded from the World Bank Open Data platform. Nevertheless, making the visualization look nice can be a bit […]

    Related Post

    ]]>
    Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

    Category

    Tags

    Choropleths are a nice tool for the visualization of geographic data and with R and Python, their creation can be pretty effortless, especially when clean data is readily available. Fortunately, a lot of economic datasets can be loaded from the World Bank Open Data platform. Nevertheless, making the visualization look nice can be a bit cumbersome. For this reason, I have created a small, integrated function that is handling the data download, adjustment, and visualization all in one. The output can easily be exported as high-quality vector-graphic using ggsave.

    To replicate the example, simply follow these three steps.

    1) Install all required packages

    install.packages("rnaturalearth")
    install.packages("rnaturalearthdata")  
    install.packages("WDI")
    install.packages("lubridate")
    install.packages("dplyr")  
    install.packages("ggplot2")
    install.packages("RColorBrewer")
    install.packages("ggplot2")
    

    2) Load the function

    #Download and visualization function
    get_and_plot_wb_data<-function(sel_indicator="NY.GDP.PCAP.KD",
                                   log="Yes",
                                   midpoint="Mean",
                                   color_scheme="Straight",
                                   start_yr = year(Sys.Date())-2,
                                   end_yr = year(Sys.Date()),
                                   legend_pos="Bottom",
                                   charttitle="GDP per Capita in US$",
                                   scale_factor=1)
    {
    #Load required packages
    library(rnaturalearth)
    library(rnaturalearthdata)  
    library(WDI)
    library(lubridate)
    library(dplyr)  
    library(ggplot2)
    library(RColorBrewer)
    library(ggplot2)
    
    #Create color scheme
    cols_gr<-c("#632523", "#953735",  "#C3D69B","#77933C", "#4F6228")
    cols_gr = colorRampPalette(cols_gr)(5)
    
    #Get all countries and countrycodes
    world <- ne_countries(scale = "medium", returnclass = "sf")
    class(world)
    
    #Load data using the worldbank API
    gdp<-WDI(
      country = "all",
      indicator = sel_indicator,
      start = start_yr,
      end = end_yr,
      extra = FALSE,
      cache = NULL,
      latest = NULL,
      language = "en"
    )
    
    names(gdp)[1]<-"iso_a2"
    names(gdp)[names(gdp)==sel_indicator]<-"ret"
    gdp<-gdp[order(gdp$country,gdp$year),]
    print(max(gdp$year))
    gdp% group_by(iso_a2) %>% do(tail(., n=1))
      
      
    world<-merge(world,gdp,by="iso_a2",all=F)
    
    #Remove some unnecssary elements
    plain <- theme(
      panel.background = element_rect(fill = "white"),
    )
    
    world$ret<-round(world$ret/scale_factor,2)
    #Choose appropriate breakpoint
    if(midpoint=="Median")
    {
      midpoint<-median((world$ret),na.rm=T)
    }else{
      if(midpoint=="Mean")
      {
        midpoint<-mean((world$ret),na.rm=T)  
      }else{
        midpoint<-midpoint
      }  
    }
    
    #Red to green or green to red
    if(color_scheme=="Inverted")
    {
      col_low<-"#276419"
      col_high<-"#8E0152"
    }else{
      col_low<-"#8E0152"
      col_high<-"#276419"
    }
    
    #Plot map with ggplot
    
    p9<-
      ggplot(data = world) +
      geom_sf(aes(fill = (ret))) +
      ggtitle(charttitle) +
      #theme(legend.position = legend_pos,legend.margin=margin(-10,-10,-10,-10),legend.box.margin=margin(-100,800,-50,-50))+
      plain 
      
    #Use log scale for skewed data
    if(log=="Yes")
    {  
      p9<-p9+  
        scale_fill_gradient2(
          trans = "log",
          low = col_low,
          mid = "white",
          high = col_high,
          midpoint = log(midpoint),
          space = "Lab",
          na.value = "grey50",
          guide = "colourbar",
          aesthetics = "fill",
          name=""
        )
    }else{
      p9<-p9+  
        scale_fill_gradient2(
          low = col_low,
          mid = "white",
          high = col_high,
          midpoint = midpoint,
          space = "Lab",
          na.value = "grey50",
          guide = "colourbar",
          aesthetics = "fill",
          name=""
        )
    }
    
    res_list<-list("dataset"=world,"visualization"=p9)
    
    return(res_list)
    
    }
    
    

    3) Run the function and save the plot

    #Run the function
    res_list<-get_and_plot_wb_data(sel_indicator="NY.GDP.PCAP.KD",midpoint="Median",log="Yes",legend_pos="bottom",charttitle="GDP per Capita in US$",scale_factor=1)
    res_list$visualization
    
    #Export the result
    target_folder<-"C:/your_path/"
    ggsave(file=paste0(target_folder,"gdp_per_capita.svg"), plot=res_list$visualization, width=14, height=7)
    
    

    Related Post

    ]]>
    https://datascienceplus.com/visualizing-economic-data-with-pretty-worldmaps/feed/ 0
    How I selected my starting word for Wordle using simulations and R https://datascienceplus.com/how-i-selected-my-starting-word-for-wordle-using-simulations-and-r/ https://datascienceplus.com/how-i-selected-my-starting-word-for-wordle-using-simulations-and-r/#respond Thu, 24 Feb 2022 02:19:34 +0000 https://datascienceplus.com/?p=31978 Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

    Category

    Tags

    Wordle has been around for some time now and I think it’s not quite necessary to explain what it is and how to play it. (If I lost you there, please read more about it (and play) here). I’m not the best for wording games, thus I don’t find them quite entertaining. But, thinking of […]

    Related Post

    ]]>
    Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

    Category

    Tags

    Wordle has been around for some time now and I think it’s not quite necessary to explain what it is and how to play it. (If I lost you there, please read more about it (and play) here). I’m not the best for wording games, thus I don’t find them quite entertaining. But, thinking of algorithmic ways to find puzzle solutions faster using R is! That’s why I started to think of random ideas regarding Wordle: for people who have played a lot, what’s their guessing word distribution look like? Are there better or worse words to start with? Are there significant more relevant letters that would be useful to guess your first try?

    I’ve seen some people answer similar questions after I started thinking on the matter, especially on most frequent letters by position (post), and a way to play and replicate the game using R (post).

    Keep in mind the “winner starting word” you find depends on:
    1. the words you pick to evaluate as possible best words,
    2. the words you are trying to predict and test toward,
    3. the valid words randomly picked to guess, based on the set of seeds picked to select those words

    Yet, it gives us a winner solution.

    1. Install and load lares

    Run install.packages("lares") and then, load it with library("lares"). No more packages needed from now on.

    2. Select your starting point

    Let’s pick some “good words” to see which ones are the best. I won’t get into details but I set a couple of rules based on letter by position frequency to set these initial words. There are 48 words that comply with these filters. I’m using some old Scrabble functions I previously developed for the library for this purpose.

    some_words <- scrabble_words(
      language = "en", # for English words
      tiles = "ESAORILTNU", # 67% cumulative letters
      force_n = 5, # comply Wordle universe of 5 letter words
      force_start = "s", # 16% words start with S
      force_str = "e")$word # most frequent letter
    sort(toupper(some_words))
     [1] "SAINE" "SALET" "SALUE" "SANER" "SAUTE" "SENOR" "SENTI" ...
    
    

    We could sample some of them, manually pick those we like most, or simply use all if you’re patient enough to run the simulations:
    best_words <- toupper(some_words)

    Now, I picked 10 random words we are going to guess, starting from the “best words” we picked. The more we use, the better it’ll represent the universe of words (which is about 13K 5 letter words).

    set.seed(2) # For reproducibility (same results)
    test_words <- toupper(sample(wordle_dictionary("en"), 10)); test_words # Words to guess randomly picked
      [1] "BOZOS" "RESET" "TROWS" "SALTS" "JIBES" "YIRKS" "DURST" "SITES" "PENGO" "GARRE"
    
    

    Finally, how many different random picking scenarios we’d like to test on each combination of best words and test words:
    seeds <- 20 # Number of different random picks to check distribution

    So basically expect for 480 iterations in total if you use these same settings. For me, it took about 30 minutes using my personal computer.
    print(length(best_words) * length(test_words), seeds)

    3. Run simulations

    Now that we already set our starting point, let’s run the simulations.

    results <- temp <- NULL
    tic("wordle_loop")
    for (word in best_words) {
      cat(sprintf("- Seed word: %s [%s/%s]\n", word, which(word == best_words), length(best_words)))
      for (objective in test_words) {
        cat(sprintf("Guess %s [%s/%s] <= %s simulations: ", objective, which(objective == test_words), length(test_words), seeds))
        temp <- wordle_simulation(word, objective, seed = 1:seeds, print = FALSE, quiet = TRUE)
        results[[word]][[objective]] <- temp
        cat(signif(mean(sapply(temp, function(x) x$iters)), 4), "\n")
      }
    }
    toc("wordle_loop")
    - Seed word: SALUE [1/48]
    Guess BOZOS [1/10] <= 20 simulations: 6.95 
    Guess RESET [2/10] <= 20 simulations: 5.5 
    Guess TROWS [3/10] <= 20 simulations: 6.4 
    Guess SALTS [4/10] <= 20 simulations: 4.7 
    Guess JIBES [5/10] <= 20 simulations: 7.05 
    Guess YIRKS [6/10] <= 20 simulations: 6.55 
    Guess DURST [7/10] <= 20 simulations: 5.05 
    Guess SITES [8/10] <= 20 simulations: 6.95 
    Guess PENGO [9/10] <= 20 simulations: 5.3 
    Guess GARRE [10/10] <= 20 simulations: 6.05
    
    - Seed word: SILEN [2/48]
    Guess BOZOS [1/10] <= 20 simulations: 6.45 
    ...
    
    

    4. Gather results and check winners

    Let’s check the sorted results to see the best and worst words.

    best_means <- lapply(results, function(a) sapply(a,function(x) mean(sapply(x, function(x) x$iters))))
    sort(sapply(best_means, mean))
    STIRE SOARE SNARE STARE STORE STALE STEAL SNORE
    5.205 5.240 5.325 5.325 5.365 5.380 5.385 5.400 ...
    
    
    hist(sapply(best_means, mean), breaks = 30)
    

    There’s a small range on these convergence means: (5.205, 5.890). That means that all these words are similarly good or bad compared with each other. But, notice we can actually easily pick or split the best from the worst words with this methodology.

    To understand this point a bit more, let’s study a random benchmark: I picked 25 random words (not selected with the “best” words criteria) as my new best_words <- toupper(sample(some_words, 25)). Then, re-ran all the code with the same parameters and test words, for a total of 250 iterations, and got the following distribution. (Note: it took 18 minutes this time)

    And theory confirmed. We did pick our first best words correctly given that the results given random words are really worse. Now the range covers a convergence mean of (5.700, 6.585). Notice that the best random words are not quite within the best words range but for a few lucky cases. And the best of the best words converge ~1.4 guesses before the worst of the random words. So we can actually do something about it and use better words to start our game!

    Final comments, asks, and considerations

    – There are other wordle_* functions in the package you can use as dictionary, to actually play, and to run simulations. Check out the nice colored results it prints. In the examples there is a small demo on how you can play with a random un-known word without limits.

    – You probably won’t have a great difference on using one of the best picked words unless you play thousands of times: it’s more a matter of big numbers (and being smart on picking the right available words when guessing).
    – You could try to re-run this analysis with a wider range of best words or test words to see if they can be improved. Note that the best words are the ones that converge sooner, thus lower iterations means.
    – Expect to have these wordle_* functions updated in CRAN in a month or so (with nicer plots as well).
    – Now, from the second input word onwards, it’s up to your picking skills. Feel free to check which words are left available using the scrabble_words() function.

    Bonus benchmarks with different “best words” criteria

    I ran one more experiment, using parallel calculations on 24 cores (which returned results in ~10min for 96 words and 40 iterations scenario), and got to the same conclusions as before, regardless of some outliers. FYI: “ESAORILT” are the most frequent letters, in order.
    1) (all) 24 words containing E, starting with S, using any of “ESAORILT” letters
    2) 96 good words containing E, using any of “ESAORILT” letters
    3) 96 random words


    If you are wondering what those lower words are (remember the randomness factor), you’d get SOARE, ALERT, and RAILE. And the worst ones? LOGOI, KEVEL, and YAFFS.

    Related Post

    ]]>
    https://datascienceplus.com/how-i-selected-my-starting-word-for-wordle-using-simulations-and-r/feed/ 0
    Forecast using Arima Model in R https://datascienceplus.com/forecast-using-arima-model-in-r/ https://datascienceplus.com/forecast-using-arima-model-in-r/#respond Mon, 14 Feb 2022 19:51:20 +0000 https://datascienceplus.com/?p=31862 Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

    Category

    Tags

    ARIMA Modeling AutoRegressive Integrated Moving Average Install Packages library(readxl) library(lmtest) library(forecast) library(FitAR) library(fUnitRoots) Import Data Set table2 <- read_excel("Datum/1 SCOPUS/2022/Feb-01/Data/table2.xlsx",sheet = "Sheet2") View(table2) summary(table2) avg_jual avg_beli Min. : 8808 Min. : 8766 1st Qu.:13498 1st Qu.:13480 Median :14190 Median :14078 Mean :13979 Mean :13800 3rd Qu.:14705 3rd Qu.:14257 Max. :15753 Max. :15020 tsJual = ts(table2$avg_jual, […]

    Related Post

    ]]>
    Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

    Category

    Tags

    ARIMA Modeling

    AutoRegressive Integrated Moving Average

    Install Packages

    library(readxl)
    library(lmtest)
    library(forecast)
    library(FitAR)
    library(fUnitRoots)
    

    Import Data Set

    table2 <- read_excel("Datum/1 SCOPUS/2022/Feb-01/Data/table2.xlsx",sheet = "Sheet2")
    View(table2) 
    
    summary(table2)
        avg_jual        avg_beli    
     Min.   : 8808   Min.   : 8766  
     1st Qu.:13498   1st Qu.:13480  
     Median :14190   Median :14078  
     Mean   :13979   Mean   :13800  
     3rd Qu.:14705   3rd Qu.:14257  
     Max.   :15753   Max.   :15020 
    
    
    tsJual = ts(table2$avg_jual, start = c(2017,1), frequency = 12)
    plot(tsJual)
    tsBeli = ts(table2$avg_beli, start = c(2017,1), frequency = 12)
    plot(tsBeli)
    
    components.tsJual = decompose(tsJual)
    plot(components.tsJual)
    components.tsBeli = decompose(tsBeli)
    plot(components.tsBeli)
    
    urkpssTest(tsJual, type = c("tau"), lags = c("short"),use.lag = NULL, doplot = TRUE)
    urkpssTest(tsBeli, type = c("tau"), lags = c("short"),use.lag = NULL, doplot = TRUE)
    
    sstationary_Jual = diff(tsJual, differences=1)
    plot(tsstationary_Jual)
    tsstationary_Beli = diff(tsBeli, differences=1)
    plot(tsstationary_Beli)
    acf(tsJual,lag.max=34)
    acf(tsBeli,lag.max=34)
    Pacf(tsJual,lag.max=34)
    Pacf(tsBeli,lag.max=34)
    
    timeseriesseasonallyadjusted_Jual <- tsJual- components.tsJual$seasonal
    tsstationary_Jual <- diff(timeseriesseasonallyadjusted_Jual, differences=1)
    timeseriesseasonallyadjusted_Beli <- tsJual- components.tsBeli$seasonal
    tsstationary_Beli <- diff(timeseriesseasonallyadjusted_Beli, differences=1)
    
    plot(timeseriesseasonallyadjusted_Beli)
    plot(timeseriesseasonallyadjusted_Jual)
    
    acf(tsstationary_Jual, lag.max=34)
    pacf(tsstationary_Jual, lag.max=34)
    acf(tsstationary_Beli, lag.max=34)
    pacf(tsstationary_Beli, lag.max=34)
    
    fitARIMA_Jual <- arima(tsJual, order=c(1,1,1),seasonal = list(order = c(1,0,0), period = 12),method="ML")
    fitARIMA_Beli <- arima(tsBeli, order=c(1,1,1),seasonal = list(order = c(1,0,0), period = 12),method="ML")
    
    coeftest(fitARIMA_Jual) 
    z test of coefficients:
    
          Estimate Std. Error z value Pr(>|z|)
    ar1  -0.021344   1.837953 -0.0116   0.9907
    ma1   0.083561   1.842706  0.0453   0.9638
    sar1  0.072859   0.274394  0.2655   0.7906
    
    
    
    coeftest(fitARIMA_Beli) 
    z test of coefficients:
    
           Estimate Std. Error z value Pr(>|z|)
    ar1   0.0032167  0.6907733  0.0047   0.9963
    ma1   0.0509199  0.7058832  0.0721   0.9425
    sar1 -0.0026367  0.3522116 -0.0075   0.9940
    
    
    fitARIMA_Jual
    Call:
    arima(x = tsJual, order = c(1, 1, 1), seasonal = list(order = c(1, 0, 0), period = 12), 
        method = "ML")
    
    Coefficients:
              ar1     ma1    sar1
          -0.0213  0.0836  0.0729
    s.e.   1.8380  1.8427  0.2744
    
    sigma^2 estimated as 472215:  log likelihood = -373.76,  aic = 755.51
    
    
    fitARIMA_Beli
    Call:
    arima(x = tsBeli, order = c(1, 1, 1), seasonal = list(order = c(1, 0, 0), period = 12), 
        method = "ML")
    
    Coefficients:
             ar1     ma1     sar1
          0.0032  0.0509  -0.0026
    s.e.  0.6908  0.7059   0.3522
    
    sigma^2 estimated as 457012:  log likelihood = -372.95,  aic = 753.91
    
    
    confint(fitARIMA_Beli)
     2.5 %    97.5 %
    ar1  -1.3506740 1.3571074
    ma1  -1.3325858 1.4344256
    sar1 -0.6929589 0.6876854
    
    
    confint(fitARIMA_Jual)
    2.5 %    97.5 %
    ar1  -3.6236644 3.5809772
    ma1  -3.5280769 3.6951992
    sar1 -0.4649435 0.6106622
    
    
    acf(fitARIMA_Beli$residuals)
    acf(fitARIMA_Jual$residuals)
    
    boxplot(fitARIMA_Jual$residuals,k=2,StartLag=1)
    LjungBoxTest(fitARIMA_Jual$residuals,k=2,StartLag=1)
    boxplot(fitARIMA_Beli$residuals,k=2,StartLag=1)
    LjungBoxTest(fitARIMA_Beli$residuals,k=2,StartLag=1)
    
    qqnorm(fitARIMA_Jual$residuals)
    qqline(fitARIMA_Jual$residuals)
    qqnorm(fitARIMA_Beli$residuals)
    qqline(fitARIMA_Beli$residuals)
    
    arima(tsJual)
    Call:
    arima(x = tsJual)
    
    Coefficients:
           intercept
          13978.5625
    s.e.    177.7277
    
    sigma^2 estimated as 1516180:  log likelihood = -409.67,  aic = 823.34
    
    
    arima(tsBeli)
    Call:
    arima(x = tsBeli)
    
    Coefficients:
           intercept
          13800.3958
    s.e.    165.1939
    
    sigma^2 estimated as 1309870:  log likelihood = -406.16,  aic = 816.32
    
    
    auto.arima(tsJual, trace=TRUE)
    ARIMA(2,1,2)(1,0,1)[12] with drift         : Inf
     ARIMA(0,1,0)            with drift         : 750.4713
     ARIMA(1,1,0)(1,0,0)[12] with drift         : 755.0944
     ARIMA(0,1,1)(0,0,1)[12] with drift         : 755.0899
     ARIMA(0,1,0)                               : 749.8579
     ARIMA(0,1,0)(1,0,0)[12] with drift         : 752.7432
     ARIMA(0,1,0)(0,0,1)[12] with drift         : 752.7402
     ARIMA(0,1,0)(1,0,1)[12] with drift         : Inf
     ARIMA(1,1,0)            with drift         : 752.7152
     ARIMA(0,1,1)            with drift         : 752.7136
     ARIMA(1,1,1) with drift         : Inf
    
     Best model: ARIMA(0,1,0)                               
    
    Series: tsJual 
    ARIMA(0,1,0) 
    
    sigma^2 = 475492:  log likelihood = -373.88
    AIC=749.77   AICc=749.86   BIC=751.62
    
    
    auto.arima(tsBeli, trace=TRUE)
    ARIMA(2,1,2)(1,0,1)[12] with drift         : Inf
     ARIMA(0,1,0)            with drift         : 748.9614
     ARIMA(1,1,0)(1,0,0)[12] with drift         : 753.5961
     ARIMA(0,1,1)(0,0,1)[12] with drift         : 753.5933
     ARIMA(0,1,0)                               : 748.1365
     ARIMA(0,1,0)(1,0,0)[12] with drift         : 751.2314
     ARIMA(0,1,0)(0,0,1)[12] with drift         : 751.2295
     ARIMA(0,1,0)(1,0,1)[12] with drift         : 753.6222
     ARIMA(1,1,0)            with drift         : 751.2183
     ARIMA(0,1,1)            with drift         : 751.2173
     ARIMA(1,1,1) with drift         : Inf
    
     Best model: ARIMA(0,1,0)                               
    
    Series: tsBeli 
    ARIMA(0,1,0) 
    
    sigma^2 = 458392:  log likelihood = -373.02
    AIC=748.05   AICc=748.14   BIC=749.9
    
    
    fitARIMA_Jual
    Call:
    arima(x = tsJual, order = c(1, 1, 1), seasonal = list(order = c(1, 0, 0), period = 12), 
        method = "ML")
    
    Coefficients:
              ar1     ma1    sar1
          -0.0213  0.0836  0.0729
    s.e.   1.8380  1.8427  0.2744
    
    sigma^2 estimated as 472215:  log likelihood = -373.76,  aic = 755.51
    
    
    fitARIMA_Beli
    all:
    arima(x = tsBeli, order = c(1, 1, 1), seasonal = list(order = c(1, 0, 0), period = 12), 
        method = "ML")
    
    Coefficients:
             ar1     ma1     sar1
          0.0032  0.0509  -0.0026
    s.e.  0.6908  0.7059   0.3522
    
    sigma^2 estimated as 457012:  log likelihood = -372.95,  aic = 753.91
    
    
    predict(fitARIMA_Jual,n.ahead = 1)
    $pred
             Jan
    2021 14665.9
    
    $se
              Jan
    2021 687.1792
    
    
    predict(fitARIMA_Beli,n.ahead = 1)
    $pred
              Jan
    2021 14122.35
    
    $se
              Jan
    2021 676.0263
    
    
    futurVal_Beli <- forecast(fitARIMA_Beli,h=1, level=c(99.5))
    futurVal_Jual <- forecast(fitARIMA_Jual,h=1, level=c(99.5))
    
    plot(futurVal_Beli)
    plot(futurVal_Jual)
    
    summary(futurVal_Jual)
    Forecast method: ARIMA(1,1,1)(1,0,0)[12]
    
    Model Information:
    
    Call:
    arima(x = tsJual, order = c(1, 1, 1), seasonal = list(order = c(1, 0, 0), period = 12), 
        method = "ML")
    
    Coefficients:
              ar1     ma1    sar1
          -0.0213  0.0836  0.0729
    s.e.   1.8380  1.8427  0.2744
    
    sigma^2 estimated as 472215:  log likelihood = -373.76,  aic = 755.51
    
    Error measures:
                       ME     RMSE      MAE      MPE     MAPE      MASE        ACF1
    Training set 107.4817 679.9846 237.3794 0.794616 1.695755 0.2583878 -0.02594214
    
    Forecasts:
             Point Forecast  Lo 99.5  Hi 99.5
    Jan 2021        14665.9 12736.97 16594.84
    
    
    summary(futurVal_Beli)
    Forecast method: ARIMA(1,1,1)(1,0,0)[12]
    
    Model Information:
    
    Call:
    arima(x = tsBeli, order = c(1, 1, 1), seasonal = list(order = c(1, 0, 0), period = 12), 
        method = "ML")
    
    Coefficients:
             ar1     ma1     sar1
          0.0032  0.0509  -0.0026
    s.e.  0.6908  0.7059   0.3522
    
    sigma^2 estimated as 457012:  log likelihood = -372.95,  aic = 753.91
    
    Error measures:
                      ME     RMSE      MAE       MPE     MAPE      MASE        ACF1
    Training set 106.293 668.9485 220.1657 0.7954476 1.599417 0.2807839 -0.02688605
    
    Forecasts:
             Point Forecast  Lo 99.5  Hi 99.5
    Jan 2021       14122.35 12224.72 16019.98
    
    

    Related Post

    ]]>
    https://datascienceplus.com/forecast-using-arima-model-in-r/feed/ 0
    Propagating nerve impulse in Hodgkin-Huxley model. Modeling with R. Part 2 https://datascienceplus.com/propagating-nerve-impulse-in-hodgkin-huxley-model-modeling-with-r-part-2/ https://datascienceplus.com/propagating-nerve-impulse-in-hodgkin-huxley-model-modeling-with-r-part-2/#respond Thu, 10 Feb 2022 19:05:00 +0000 https://datascienceplus.com/?p=31908 Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

    Category

    Tags

    Introduction In this second part we will present a numerical method for solving the PDE system, which describes the propagation of action potential. We will make use of the R-Packages deSolve and ReacTran to simulate the model. The underlying Hodgkin-Huxley model used for our simulation is actually based on the telegraph equations. In contrast to […]

    Related Post

    ]]>
    Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

    Category

    Tags

    Introduction

    In this second part we will present a numerical method for solving the PDE system, which describes the propagation of action potential. We will make use of the R-Packages deSolve and ReacTran to simulate the model. The underlying Hodgkin-Huxley model used for our simulation is actually based on the telegraph equations. In contrast to the standard models, where the inductance is nelegted, here we will also use the Hodgkin-Huxley model but without neglecting the self conductance of the axon Isn’t there an inductance factor in the plasma membrane of nerves?. This model is based on the \((RLC)\)(Resistance-Inductance-Capacitance) electric circuit analogue in which ionic currents through the cylindrical membrane are also taken into account.

    Propagation Action Potential

    The mathematical equation describing the propagation in space and time of the action potential \(V_m\) along a neural axon is given by :
    $$ \begin{align}\frac{\partial^2 V_m}{\partial^2x}- LC_a\frac{\partial^2V_m}{\partial^2t}= \frac{2}{a}RC_a\frac{\partial V_m}{\partial t} + \frac{2}{a}L\frac{\partial I_{}ion}{\partial t} + \frac{2}{a}RI_{ion} \hspace{30pt} (C.6)\end{align} $$
    Where:
    \(V_m\) is the potential difference across the membrane (dependent variable, depends on \(x\) and \(t\)).
    \(x \) is independent variable representing one dimension of three-dimensional space.
    \(t\) is independent variable representing time.
    \(L\) is the axon specific self-inductance.
    \(R\) is the specific resistance of an axon.
    \(C_a\) is the axon self-capacitance per unit area per unit length.
    \(I_{ion}\) is the sum of ions currents.
    \(a\) is the axon radius.
    The derivation of the equation \((C.6)\) for axon represented by the \(RLC\) (Resistance-Inductance-Capacitance) circuit is performed in Appendix A (In case you are interested in the derivation of the equation \((C.6)\) so just send me a mail at kritiker2017@gmail.com.). Note if the presence of induction in the system is neglected \((L=0)\), the equation \((C.6)\) becomes the non-linear cable equation, which is not resistant to analytical approaches.
    In the Hodgkin-Huxley model the ion current \(I_{ion}\) is defined as the sum of ions currents of potassium and sodium ( \(I_K\) and \(I_{Na}\)), and a smaller current (\(I_L\) ) made up of chloride and other ions:
    $$I_{ion} = I_K + I_{Na} + I_L= g_K(V_m -V_K)+g_{Na}(V_m – V_{Na}) + g_L(V_m – V_L) \hspace{30pt} (C.7)$$ where \(g_K\) , \(g_{Na}\) and \(g_L\) are potassium, sodium and leakage conductances, respectively.
    We define:
    the diffusion coefficient as \(D^2 = \frac{a}{2LC_m}\)
    the relaxation time as \(\tau = \frac{L}{R}\)
    the parameter \(\mu = \frac{\tau}{C_m}\) characterizes the inductance in the system.

    Substituting the equation \((C.7)\) into equation \((C.6)\) we obtain the final nerve propagation equation in the Hodgkin & Huxley model, which will be used for our simulation.
    $$
    \begin{align}\tau\frac{\partial^2 V_m}{\partial^2t} = \tau D^2\frac{\partial^2 V_m }{\partial^2 x} – \big[1+\mu(g_{K}n^4 + g_{Na}m^3h+g_{L})\big]\frac{\partial V_m}{\partial t} – \\ g_{K}(\frac{\mu}{\tau}n^4 + 4\mu n^3\frac{\partial n}{\partial t})(V_m – V_K)
    – \\ g_{Na}\big[\frac{\mu}{\tau}m^3h + \mu(3m^2h\frac{\partial m}{\partial t}+ m^3\frac{\partial h}{\partial t}) \big](V_m – V_{Na}) – \\ g_L\frac{\mu}{\tau}(V_m – V_L) \hspace{40pt} (C.8)\end{align} $$

    Numeric solution

    To solve the equation \((C.8)\) we will make use of the R-Packages {deSolve} and {ReacTran}. The [Package deSolve is an add-on package of the open source data analysis system R for the numerical treatment of systems of differential equations.].
    In this blog post we solve the equation \((C.8)\) on a one-dimensional domain \(\Omega = [0,15]\), with the initial conditions \(V_{m}(x, 0) = -15 \exp{-\frac{x^2}{D^2}}\) and \(\frac{\partial V_{m}(x,0)}{\partial t} = 0 \) and boundary conditions of Dirichlet type \( V_{m}(x = 0, t) = 0\), \(V_{m}(x = 15, t) = 0\). The “method of lines”, where space(\(\Omega\)) is discretized in fixed steps while time is left in continuous form, will be used.

    library(ReacTran)
      # Create one-dimensional finite difference grid
      dx <- 1
      xgrid <- setup.grid.1D(x.up = 0, x.down = 10, dx.1 = dx) 
      x <- xgrid$x.mid
      N <- xgrid$N
      # Model Parameters
      ## Passive parameters of the neuron
      a <- 238*10^(-4) # axon radius (cm)
      R <- 35.4        # Membrane Capacitance (Ohm cm)
      L <- 15          # Axon specific self-inductance
      C_m <- 0.001     # Membrane capacitance density Cm 1.0 micro F/cm2
      D_coefficient <- sqrt(a/(2*L*C_m))
      tau <- L/R
      mu <- tau/C_m
      Iinj <- 0       # injected current 
      # Values of the neuron 
      g_K  <- 0.036       # conductance density g_K
      g_Na <- 0.12        # conductance density g_Na
      g_L  <- 0.0003      # conductance density g_L
      v_K  <- 12          # K reversal potential
      v_Na <- -115        # Na reversal potential
      v_L <- -10.5989     # Leak reversal potential
      # Function ion Channel
      ## Rate functions for K activation (variable n)
      alpha_n <- function(v) 0.01*(v+10)/(-1+exp((v+10)/10))
      beta_n  <- function(v) 0.125*exp(v/80)
      ## Rate functions for Na activation (variable m)
      alpha_m <- function(v) 0.1*(v+25)/(-1+exp((v+25)/10))
      beta_m  <- function(v) 4*exp(v/18) 
      # Rate functions for Na inactivation (variable h)
      alpha_h <- function(v) 0.07*exp(v/20)
      beta_h  <- function(v) (1+exp((v+30)/10))^-1
      # Derivatives of ion channel functions/Kinetic equations for channel variables
      dndt <- function(n,v)(alpha_n(v)*(1-n)-beta_n(v)*n)
      dmdt <- function(m,v)(alpha_m(v)*(1-m)-beta_m(v)*m)
      dhdt <- function(h,v)(alpha_h(v)*(1-h)-beta_h(v)*h)
      # Initial conditions  
      ## In the resting state V = 0
      V_0  <- 0
      n_0  <- alpha_n(V_0)/(alpha_n(V_0)+beta_n(V_0))
      m_0  <- alpha_m(V_0)/(alpha_m(V_0)+ beta_m(V_0))
      h_0  <- alpha_h(V_0)/(alpha_h(V_0)+beta_h(V_0))
      vini <- (-15)*exp(-x^2/D_coefficient^2)
      # vini <- -15*sin(pi*x)
      # vini <- rep(-15, N)
      uini  <- rep(0, N)
      nini  <- rep(n_0, N)
      mini  <- rep(m_0, N)
      hini  <- rep(h_0, N)
      yini  <- c(vini, uini, nini, mini, hini)
      # Model equations/Differential equations 
      Pulse_propagation <- function (t, y, parms) {
        v <- y[1:N]
        u <- y[(N+1):(2*N)]
        n <- y[(2*N+1):(3*N)]
        m <- y[(3*N+1):(4*N)]
        h <- y[(4*N+1):(5*N)]
        
        dv <- u
        du <- tran.1D(C = v, C.up = 0, C.down = 0,  D = D_coefficient, dx = xgrid)$dC - 
          (1/tau + (mu/tau)*(g_K*n^4 + g_Na*m^3*h + g_L))*u - 
          g_K*((mu/tau^2)*n^4 + 4*(mu/tau)*n^3*dndt(n,v))*(v-v_K) - 
          g_Na*((mu/tau^2)*m^3*h + (mu/tau)*(3*m^2*h*dmdt(m,v) + m^3*dhdt(h,v)))*(v -v_Na) - 
          g_L*(mu/tau^2)*(v - v_L) + Iinj 
          
        dn <- (alpha_n(v)*(1-n)-beta_n(v)*n)
        dm <- (alpha_m(v)*(1-m)-beta_m(v)*m)
        dh <- (alpha_h(v)*(1-h)-beta_h(v)*h)
        return(list(c(dv,du, dn, dm, dh)))
      }
    

    We’ve done with building the model. In the above model code the action potential \(V_m\) is represented by the state variable v, which represents the dynamical attitude of the transmission for the nerve impulses of a nervous system. We define now the time simulation and run the model to solve the equation \((C.8)\).

     # Specify the time at which the output is wanted  
      time_3 <- seq(0,15, 0.01)
    

    The defined model is one-dimensional (one spatial independent variable), so we use the function ode.1D here. The time it takes to solve the model (in seconds) is also printed.

     print(system.time(
        out_3 <- ode.1D(y = yini, func = Pulse_propagation, 
                        times = time_3, method = "adams", parms = NULL, 
                        dimens = N, nspec = 5,  
                        names = c("v", "u", "n", "m", "h"))))
           User      System     verstrichen 
           0.21        0.00        0.20 
    
    

    The model output/results “out_3” is a matrix, which contains all the data needed to analyse and visualize the results of the simulation.
    Before plotting the numeric solution of the PDE \((C.8)\), we will first check if the integration was successful, and to get detailed information about the success of our simulation we use the function diagnostics{deSolve} and the function summary{deSolve}

    Diagnostics and Summary

    # Print detailed information about the success of our 
      diagnostics(out_3)
    --------------------
    lsode return code
    --------------------
    
      return code (idid) =  2 
      Integration was successful.
    
    --------------------
    INTEGER values
    --------------------
    
      1 The return code : 2 
      2 The number of steps taken for the problem so far: 1950 
      3 The number of function evaluations for the problem so far: 2120 
      5 The method order last used (successfully): 4 
      6 The order of the method to be attempted on the next step: 4 
      7 If return flag =-4,-5: the largest component in error vector 0 
      8 The length of the real work array actually required: 820 
      9 The length of the integer work array actually required: 20 
     14 The number of Jacobian evaluations and LU decompositions so far: 0 
     
    --------------------
    RSTATE values
    --------------------
    
      1 The step size in t last used (successfully): 0.01 
      2 The step size to be attempted on the next step: 0.01 
      3 The current value of the independent variable which the solver has reached: 15.00627 
      4 Tolerance scale factor > 1.0 computed when requesting too much accuracy: 0 
    
    
      summary(out_3)
    
                        v             u            n            m            h
    Min.    -1.052214e+02 -3.101267e+02 3.176769e-01 1.396179e-02 7.750275e-02
    1st Qu. -2.231145e+00 -1.013699e+00 3.177112e-01 2.289523e-02 2.584720e-01
    Median  -4.608204e-06 -7.997909e-02 3.706068e-01 5.293269e-02 5.108495e-01
    Mean    -8.208256e+00  5.239505e-01 4.551956e-01 1.883197e-01 4.269989e-01
    3rd Qu.  7.670459e+00  4.150809e-06 5.910743e-01 7.744576e-02 5.960693e-01
    Max.     1.204040e+01  7.670492e+01 7.681718e-01 9.940305e-01 5.961208e-01
    N        1.501000e+04  1.501000e+04 1.501000e+04 1.501000e+04 1.501000e+04
    sd       2.720432e+01  3.933037e+01 1.595711e-01 3.139004e-01 1.858227e-01 
    
    
    

    Plot the results

    Time evolution of membrane potential of H&H neuron at various distances along the axon.

    As we can see in the following plot, when an excitable membrane is incorporated into a nonlinear equation \((C.6)\) and we use accommodated initial and boundary conditions, the model can give rise to traveling waves of electrical excitation and the action potential repeats itself at successive locations along the axon. The magnitude and duration of action potential remains the same throughout the propagation along the neuron’s axon.

     library(plotly)
      library(see)
      # Gather columns into key-value pairs
      ldata % tidyr::gather(Shape, Action_Potential, -time) 
      # head(ldata)
      ldata_1_9 %filter(Shape == (1):(N-1))
      plot_ldata <- ggplot(ldata_1_9, 
             aes(x = time,
                 y = - Action_Potential, 
                 group = Shape,
                 col = Shape
                 )) + labs(title = "Action Potential profiles through time \n at various axon distances") + geom_line() + theme_abyss() # + geom_hline(yintercept = 15, color = "red", lwl =2)
      ggplotly(plot_ldata) 
    


    Figure 1: Numerical solution of equation \((C.8)\) showing the time course of action potential. The action potential repeats itself at successive locations along the axon.

    3D Plot of the numeric solution

    To visualize the results in 3D the R version of the package{plotly} is used.

     action_potential <- subset(out_3, which = c("v"))
      fig_surface % 
        layout(title = list(text = " 3D Plot of the numeric solution", y = 0.95 ), scene = list(
                xaxis = list(title = "Distance"),
                yaxis = list(title = "Time"),
                zaxis = list(title = "v: Action Potential")
                )
               )
      fig_surface 
    


    Figure 2: 3D Presentation of the numerical solution of equation \((C.8)\).

    The following plots are showing the time course of the other dependent variables, n, m, h and dv/dt.

    Plot_all_gatting <- function(x,y, my_title){
        library(plotly)
        library(see)
          ldata %tidyr::gather(Shape, variable, -time) 
        # head(ldata)
          ldata_1_9 %filter(Shape == x:y)
          plot_ldata <- ggplot(ldata_1_9, aes(x = time, y = variable , group = Shape, col = Shape )) + 
          labs(title = my_title ) +
           geom_line() + theme_radar_dark()
          }
      n <- Plot_all_gatting(2*N+1, 3*N, "n profiles through time \n at various axon distances")
      m <- Plot_all_gatting(3*N+1,4*N, "m profiles through time \n at various axon distances")
      h <- Plot_all_gatting(4*N+1,5*N, " h profiles through time \n at various axon distances")
      u <- Plot_all_gatting(N+1, 2*N, "dV/dt profiles through time \n at various axon distances")
      gridExtra::grid.arrange(n, m, h, u, ncol =2)
    


    Figure 3: Time courses of dv/dt and the gating functions n,m,h.

    3D Plots of action Potential, and the gating variables n,m,h

    dVdt <- subset(out_3, which = c("u"))
      M <-   subset(out_3, which = c("m"))
      N <-   subset(out_3, which = c("n"))
      H <-   subset(out_3, which = c("h"))
      # custom grid style
      axx <- list(
        gridcolor='rgb(255, 255, 255)',
        zerolinecolor='rgb(255, 255, 255)',
        showbackground= TRUE,
        backgroundcolor='rgb(230, 230,230)')
      
      # individual plots
      fig1 <- plot_ly(x = x, y = time_3,  z = - dVdt , scene='scene')
      fig1 % add_surface(showscale=FALSE)
      
      fig2 <- plot_ly(x = x, y = time_3,z = M, scene='scene2')
      fig2 % add_surface(showscale=FALSE)
      
      fig3 <- plot_ly(x = x, y = time_3,z = N, scene='scene3')
      fig3 % add_surface(showscale=FALSE)
      
      fig4 <- plot_ly(x = x, y = time_3,z = H, scene='scene4')
      fig4 % add_surface(showscale=FALSE)
      
      # subplot and define scene
      fig <- subplot(fig1, fig2, fig3, fig4) 
      fig % layout(title = "3D Plots of the numeric solution for dv/dt, n,m and h", 
                            scene = list(domain=list(x=c(0,0.5),y=c(0.5,1)),
                                         xaxis = c(axx, list(title = "Distance")),
                                         yaxis = c(axx, list(title = "Time")),
                                         zaxis = c(axx, list(title = "dv/dt")),
                                         aspectmode='cube'),
                            scene2 = list(domain=list(x=c(0.5,1),y=c(0.5,1)),
                                          xaxis = c(axx, list(title = "Distance")),
                                          yaxis = c(axx, list(title = "Time")),
                                          zaxis = c(axx, list(title = "Gatting \n variable m")),
                                          aspectmode='cube'),
                            scene3 = list(domain=list(x=c(0,0.5),y=c(0,0.5)),
                                          xaxis = c(axx, list(title = "Distance")),
                                          yaxis = c(axx, list(title = "Time")),
                                          zaxis = c(axx, list(title = "Gatting \n variable n")),
                                          aspectmode='cube'),
                            scene4 = list(domain=list(x=c(0.5,1),y=c(0,0.5)),
                                          xaxis = c(axx, list(title = "Distance")),
                                          yaxis = c(axx, list(title = "Time")),
                                          zaxis = c(axx, list(title = "Gatting \n variable h")),
                                          aspectmode='cube'))
      
      
      fig
    


    Fig 4: 3D representation of the numerical solution of dv/dt, n, m and h.
    and finally the contour plot

    Contour Plot

     action_potential <- subset(out_3, which = "v")
      fig_prop <- plot_ly(visible = TRUE, 
                     x = x ,
                     y = time_3, 
                     z = action_potential, 
                     type = "contour", contours = list(showlabels = TRUE))
      fig_prop % colorbar(title = "The independent \n variable") %>% 
        layout(title = list(text = "Variable density", y = 0.98), xaxis = list(title = "Distance "), yaxis = list(title =  "Time" ))
      fig_prop 
    

    Shiny application

    Using my new Shiny application (not yet published, it is still under development), one can explore the spatiotemporal dynamics of the action potential and the ionic currents. Just to show an example, when we change the initial value of the action potential in the Shiny application to -65mv (instead of -15mv in the above model) the model produces more than one spike and generates a periodic response. With this app one can also study the influence of the axon specific self-inductance on the spatiotemporal dynamics of the action potential and its proprieties (all-or-none rule, ….). In the Shiny application if we change the initial value of the action potential and then just click the update button, so one can get the results as shown below:


    Summary and Outlook

    In this blog post we solved the one dimensional PDE using only R-Packages, this hyperbolic PDE is describing the dynamics of the action potential along an axon. Our numerical solution shows the well known results, namely the transmission/production of the action potential along the axon does not change its magnitude and shape.
    Next I will investigate a more realistic model, the two dimension nerve pulse propagation model.

    References

    Contact:

    kritiker2017@gmail.com.

    Related Post

    ]]>
    https://datascienceplus.com/propagating-nerve-impulse-in-hodgkin-huxley-model-modeling-with-r-part-2/feed/ 0
    Ditch p-values. Use Bootstrap confidence intervals instead https://datascienceplus.com/ditch-p-values-use-bootstrap-confidence-intervals-instead/ https://datascienceplus.com/ditch-p-values-use-bootstrap-confidence-intervals-instead/#respond Mon, 08 Nov 2021 19:55:39 +0000 https://datascienceplus.com/?p=31741 Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

    Category

    Tags

    P-values don’t mean what people think they mean; they rely on hidden assumptions that are unlikely to be fulfilled; they detract from the real questions. Here’s how to use the Bootstrap in R instead This post was first published on Towards Data Science and is based on my book “Behavioral Data Analysis with R and Python” […]

    Related Post

    ]]>
    Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

    Category

    Tags

    P-values don’t mean what people think they mean; they rely on hidden assumptions that are unlikely to be fulfilled; they detract from the real questions. Here’s how to use the Bootstrap in R instead

    This post was first published on Towards Data Science and is based on my book “Behavioral Data Analysis with R and Python”

    A few years ago, I was hired by one of the largest insurance companies in the US to start and lead their behavioral science team. I had a PhD in behavioral economics from one of the top 10 economics departments in the world and half a decade of experience as a strategy consultant, so I was confident my team would be able to drive business decisions through well-crafted experiments and data analyses.
    And indeed, I worked with highly-skilled data scientists who had a very sharp understanding of statistics. But after years of designing and analyzing experiments, I grew dissatisfied with the way we communicated results to decision-makers. I felt that the over-reliance on p-values led to sub-optimal decisions. After talking to colleagues in other companies, I realized that this was a broader problem, and I set up to write a guide to better data analysis. In this article, I’ll present one of the biggest recommendations of the book, which is to ditch p-values and use Bootstrap confidence intervals instead.

    Ditch p-values

    There are many reasons why you should abandon p-values, and I’ll examine three of the main ones here:

    • They don’t mean what people think they mean
    • They rely on hidden assumptions that are unlikely to be fulfilled
    • They detract from the real questions

    1. They don’t mean what people think they mean

    When you’re running applied data analyses, whether in the private, non-profit or public sectors, your goal is to inform decisions. A big part of the problem we need to solve is uncertainty: the data tells us that the number is 10, but could it be 5 instead? Maybe 100? Can we rely on the number that the data analysis spouted? After years, sometimes decades, of educating business partners on the matter, they generally understand the risks of uncertainty. Unfortunately, they often jump from there to assuming that the p-value represents the probability that a result is due to chance: a p-value of 0.03 would mean that there’s a 3% chance a number we thought were positive is indeed null or negative. It does not. In fact, it represents the probability of seeing the result we saw assuming that the real value is indeed zero.
    In scientific jargon, the real value being zero or negative is called the null hypothesis (abbreviated H0), and the real value being strictly above zero is called the alternative hypothesis (abbreviated H1). People mistakenly believe that the p-value is the probability of H0 given the data, P(H0|data), when in reality it is the probability of the data given H0, P(data|H0). You may be thinking: potato po-tah-to, that’s hair splitting and a very small p-value is indeed a good sign that a result is not due to chance. In many circumstances, you’ll be approximately correct, but in some cases, you’ll be utterly wrong.
    Let’s take a simplified but revealing example: we want to determine Robert’s citizenship. Null hypothesis: H0, Robert is a US citizen. Alternative hypothesis: H1, he is not. Our data: we know that Robert is a US senator. There are 100 senators out of 330 million US citizens, so under the null hypothesis, the probability of our data (i.e., the p-value) is 100 / 300,000,000 ≈ 0.000000303. Per the rules of statistical significance, we can safely conclude that our null hypothesis is rejected and Robert is not a US citizen. That’s obviously false, so what went wrong? The probability that Robert is a US senator is indeed very low if he is a US citizen, but it’s even lower if he is not (namely zero!). P-values cannot help us here, even with a stricter 0.01 or even 0.001 threshold (for an alternative illustration of this problem, see xkcd).

    2. They rely on hidden assumptions

    P-values were invented at a time when all calculations had to be done by hand, and so they rely on simplifying statistical assumption. Broadly speaking, they assume that the phenomenon you’re observing obeys some regular statistical distribution, often the normal distribution (or a distribution from the same family). Unfortunately, that’s rarely true²:

    • Unless you’re measuring some low-level psycho-physiological variable, your population of interest is generally made up of heterogeneous groups. Let’s say you’re a marketing manager for Impossible Burgers looking at the demand for meat substitutes. You would have to account for two groups: on the one hand, vegetarians, for whom the relevant alternative is a different vegetarian product; on the other hand, meat eaters, who can be enticed but will care much more about taste and price compared to meat itself.
    • A normal distribution is symmetrical and extends to infinity in both directions. In real life, there are asymmetries, threshold and limits. People never buy negative quantities, nor infinite ones. They don’t vote at all before they’re 18 and the market of 120-year-old is much narrower than the market for 90-year-old and 60-year-old would suggest.
    • Conversely, we see “fat-tailed” distributions, where extreme values are much more frequent than expected from a normal distribution. There are more multi-billionaires than you would expect from looking at the number of millionaires and billionaires.

    This implies that the p-values coming from a standard model are often wrong. Even if you correctly treat them as P(data|H0) and not P(H0|data), they’ll often be significantly off.

    3. They detract from the real questions

    Let’s say that you have taken to heart the two previous issues and built a complete Bayesian model that finally allows you to properly calculate P(H0|data), the probability that the real value is zero or negative given the observed data. Should you bring it to your decision-maker? I would argue that you shouldn’t, because it doesn’t reflect economic outcomes.
    Let’s say that a business decision-maker is pondering two possible actions, A and B. Based on observed data, the probability of zero or negative benefits is:

    • 0.08 for action A
    • 0.001 for action B

    Should the decision-maker pick action B based on these numbers? What if I told you that the corresponding 90% confidence intervals are:

    • [-$0.5m; $99.5m] for action A
    • [$0.1m; $0.2m] for action B

    Action B may have a lower probability of leading to a zero or negative outcome, but its expected value for the business is much lower, unless the business is incredibly risk-averse. In most situations, “economic significance” for a decision-maker hangs on two questions:
    How much money are we likely to gain? (aka, the expected value)
    In a “reasonably-likely worst-case scenario”, how much money do we stand to lose? (aka, the lower bound of the confidence interval)
    Confidence intervals are a much better tool to answer these questions than a single probability number.

    Use the Bootstrap instead

    Let’s take a concrete example, adapted and simplified from my book¹. A company has executed a time study of how long it takes its bakers to prepare made-to-order cakes depending on their level of experience. Having an industrial engineer measure how long it takes to make a cake in various stores across the country is expensive and time-consuming, so the data set has only 10 points, as you can see in the following figure.
    Scatterplot of baking time vs months of experience

    In addition to the very small size of the sample, it contains an an outlier, in the upper left corner: a new employee spending most of a day working on a complex cake for a corporate retreat. How should the data be reported to business partners? One could discard the extreme observation. But that observation, while unusual, is not an aberration per se. There was no measurement error, and those circumstances probably occur from time to time. An other option would be to only report the overall mean duration, 56 minutes, but that would also be misleading because it would not convey the variability and uncertainty in the data.
    Alternatively, one could calculate a normal confidence interval (CI) based on the traditional statistical assumptions. Normal confidence intervals are closely linked to the p-value: an estimate is statistically significant if and only if the corresponding confidence interval does not include zero. As you’ll learn in any stats class, the lower limit of a normal 95%-CI is equal to the mean minus 1.96 times the standard error, and the upper limit is equal to the mean plus 1.96 times the standard error. Unfortunately, in the present case the confidence interval is [-23;135] and we can imagine that business partners would not take too kindly to the possibility of negative baking duration…
    This issue comes from the assumption that baking times are normally distributed, which they are obviously not. One could try to fit a better distribution, but using a Bootstrap confidence interval is much simpler.

    <3>1. The Bootstrap works by drawing with replacement
    To build Bootstrap confidence intervals, you simply need to build “a lot of similar samples” by drawing with replacement from your original sample. Drawing with replacement is very simple in R, we just set “replace” to true:
    boot_dat <- slice_sample(dat, n=nrow(dat), replace = TRUE)
    Why are we drawing with replacement? To really grasp what’s happening with the Bootstrap, it’s worth taking a step back and remembering the point of statistics: we have a population that we cannot fully examine, so we’re trying to make inferences about this population based on a limited sample. To do so, we create an “imaginary” population through either statistical assumptions or the Bootstrap. With statistical assumptions we say, “imagine that this sample is drawn from a population with a normal distribution,” and then make inferences based on that. With the Bootstrap we’re saying, “imagine that the population has exactly the same probability distribution as the sample,” or equivalently, “imagine that the sample is drawn from a population made of a large (or infinite) number of copies of that sample.” Then drawing with replacement from that sample is equivalent to drawing without replacement from that imaginary population. As statisticians will say, “the Bootstrap sample is to the sample what the sample is to the population.”

    We repeat that process many times (let’s say 2,000 times):
    mean_lst <- list()
    B <- 2000
    N <- nrow(dat)
    for(i in 1:B){
    boot_dat <- slice_sample(dat, n=N, replace = TRUE)
    M <- mean(boot_dat$times)
    mean_lst[[i]] <- M}
    mean_summ <- tibble(means = unlist(mean_lst))

    The result of the procedure is a list of Bootstrap sample means, which we can plot with an histogram:
    Histogram of Bootstrap sample means
    As you can see, the histogram is very irregular: there is a peak close to the mean of our original data set along with smaller peaks corresponding to certain patterns. Given how extreme our outlier is, each of the seven peaks corresponds to its number of repetitions in the Bootstrap sample, from zero to six. In other words, it doesn’t appear at all in the samples whose means are in the first (leftmost) peak, it appears exactly once in the samples whose means are in the second peak, and so on.

    2. Calculating the confidence interval from the Bootstrap samples

    From there, we can calculate the Bootstrap confidence interval (CI). The bounds of the CI are determined from the empirical distribution of the preceding means. Instead of trying to fit a statistical distribution (e.g., normal), we can simply order the values from smallest to largest and then look at the 2.5% quantile and the 97.5% quantile to find the two-tailed 95%-CI. With 2,000 samples, the 2.5% quantile is equal to the value of the 50th smallest mean (because 2,000 * 0.025 = 50), and the 97.5% quantile is equal to the value of the 1950th mean from smaller to larger, or the 50th largest mean (because both tails have the same number of values). Fortunately, we don’t have to calculate these by hand:
    LL_b <- as.numeric(quantile(mean_summ$means, c(0.025)))
    UL_b <- as.numeric(quantile(mean_summ$means, c(0.975)))

    The following figure shows the same histogram as before but adds the mean of the means, the normal CI bounds and the Bootstrap CI bounds.

    Histogram of Bootstrap sample means

    Distribution of the means of 2,000 samples, with mean of the means (thick blue line), normal 95%-CI bounds (dotted black lines), and Bootstrap CI bounds (dashed blue lines)[/caption]
    The Bootstrap 95%-CI is [7.50; 140.80] (plus or minus some sampling difference), which is much more realistic. No negative values as with the normal assumptions!

    3. The Bootstrap is more flexible and relevant for business decisions

    Once you start using the Bootstrap, you’ll be amazed at its flexibility. Small sample size, irregular distributions, business rules, expected values, A/B tests with clustered groups: the Bootstrap can do it all!
    Let’s imagine, for example, that the company in our previous example is considering instituting a time promise - ”your cake in three hours or 50% off” - and wants to know how often a cake currently takes more than three hours to be baked. Our estimate would be the sample percentage: it happens in 1 of the 10 observed cases, or 10%. But we can’t leave it at that, because there is significant uncertainty around that estimate, which we need to convey. Ten percent out of 10 observations is much more uncertain than 10% out of 100 or 1,000 observations. So how could we build a CI around that 10% value? With the Bootstrap, of course!
    The histogram of Bootstrap estimates also offers a great visualization to show business partners: “This vertical line is the result of the actual measurement, but you can also see all the possible values it could have taken”. The 90% or 95% lower bound offers a “reasonable worst case scenario” based on available information.
    Finally, if your boss or business partners are dead set on p-values, the Bootstrap offers a similar metric, the Achieved Significance Level (ASL). The ASL is simply the percentage of Bootstrap values that are zero or less. That interpretation is very close to the one people wrongly assign to the p-value, so there’s very limited education needed: “the ASL is 3% so there’s a 3% chance that the true value is zero or less; the ASL is less than 0.05 so we can treat this result as significant”.

    Conclusion

    To recap, even though p-values remain ubiquitous in applied data analysis, they have overstayed their welcome. They don’t mean what people generally think they mean; and even if they did, that’s not the answer that decision-makers are looking for. Even in academia, there’s currently a strong push towards reducing the blind reliance on p-values (see the ASA statement below⁴).
    Using Bootstrap confidence intervals is both easier and more compelling. They don’t rely on hidden statistical assumptions, only on a straightforward one: the overall population looks the same as our sample. They provide information on possible outcomes that is richer and more relevant to business decisions.

    But wait, there’s more!

    Here comes the final shameless plug. If you want to learn more about the Bootstrap, my book will show you:

    • How to determine the number of Bootstrap samples to draw;
    • How to apply the Bootstrap to regression, A/B test and ad-hoc statistics;
    • How to write high-performance code that will not have your colleagues snickering at your FOR loops;
    • And a lot of other cool things about analyzing customer data in business.

    Cover of book "Behavioral Data Analysis with R and Python"

    References

    [1] F. Buisson, Behavioral Data Analysis with R and Python (2021). My book, obviously! The code for the example is available on the book’s Github repo.
    [2] R. Wilcox, Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy (2010). We routinely assume that our data is normally distributed “enough.” Wilcox shows that this is unwarranted and can severely bias analyses. A very readable book on an advanced topic.
    [3] See my earlier post on Medium “Is Your Behavioral Data Truly Behavioral?”.
    [4] R. L. Wasserstein & N. A. Lazar (2016) “The ASA Statement on p-Values: Context, Process, and Purpose”, The American Statistician, 70:2, 129–133, DOI:10.1080/00031305.2016.1154108.

    Related Post

    ]]>
    https://datascienceplus.com/ditch-p-values-use-bootstrap-confidence-intervals-instead/feed/ 0
    Real Plug-and-Play Supervised Learning AutoML using R and lares https://datascienceplus.com/real-plug-and-play-supervised-learning-automl-using-r-and-lares/ https://datascienceplus.com/real-plug-and-play-supervised-learning-automl-using-r-and-lares/#respond Mon, 11 Oct 2021 22:28:22 +0000 https://datascienceplus.com/?p=31547 Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

    Category

    Tags

    The lares package has multiple families of functions to help the analyst or data scientist achieve quality robust analysis without the need of much coding. One of the most complex but valuable functions we have is h2o_automl, which semi-automatically runs the whole pipeline of a Machine Learning model given a dataset and some customizable parameters. […]

    Related Post

    ]]>
    Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

    Category

    Tags

    The lares package has multiple families of functions to help the analyst or data scientist achieve quality robust analysis without the need of much coding. One of the most complex but valuable functions we have is h2o_automl, which semi-automatically runs the whole pipeline of a Machine Learning model given a dataset and some customizable parameters. AutoML enables you to train high-quality models specific to your needs and accelerate the research and development process.

    HELP: Before getting to the code, I recommend checking h2o_automl's full documentation here or within your R session by running ?lares::h2o_automl. In it you'll find a brief description of all the parameters you can set into the function to get exactly what you need and control how it behaves.

    Pipeline

    In short, these are some of the things that happen on its backend:

    Mapping `h2o_automl`

    1. Input a dataframe df and choose which one is the independent variable (y) you'd like to predict. You may set/change the seed argument to guarantee reproducibility of your results.

    2. The function decides if it's a classification (categorical) or regression (continuous) model looking at the independent variable's (y) class and number of unique values, which can be control with the thresh parameter.

    3. The dataframe will be split in two: test and train datasets. The proportion of this split can be control with the split argument. This can be replicated with the msplit() function.

    4. You could also center and scale your numerical values before you continue, use the no_outliers to exclude some outliers, and/or impute missing values with MICE. If it's a classification model, the function can balance (under-sample) your training data. You can control this behavior with the balance argument. Until here, you can replicate the whole process with the model_preprocess() function.

    5. Runs h2o::h2o.automl(...) to train multiple models and generate a leaderboard with the top (max_models or max_time) models trained, sorted by their performance. You can also customize some additional arguments such as nfolds for k-fold cross-validations, exclude_algos and include_algos to exclude or include some algorithms, and any other additional argument you wish to pass to the mother function.

    6. The best model given the default performance metric (which can be changed with stopping_metric parameter) evaluated with cross-validation (customize it with nfolds), will be selected to continue. You can also use the function h2o_selectmodel() to select another model and recalculate/plot everything again using this alternate model.

    7. Performance metrics and plots will be calculated and rendered given the test predictions and test actual values (which were NOT passed to the models as inputs to be trained with). That way, your model's performance metrics shouldn't be biased. You can replicate these calculations with the model_metrics() function.

    8. A list with all the inputs, leaderboard results, best selected model, performance metrics, and plots. You can either (play) see the results on console or export them using the export_results() function.

    Load the library

    Now, let's (install and) load the library, the data, and dig in:

    # install.packages("lares")
    library(lares)
    
    # The data we'll use is the Titanic dataset
    data(dft)
    df <- subset(dft, select = -c(Ticket, PassengerId, Cabin))
    

    NOTE: I'll randomly set some parameters on each example to give visibility on some of the arguments you can set to your models. Be sure to also check all the print, warnings, and messages shown throughout the process as they may have relevant information regarding your inputs and the backend operations.

    Modeling examples

    Let's have a look at three specific examples: classification models (binary and multiple categories) and a regression model. Also, let's see how we can export our models and put them to work on any environment.

    Classification: Binary

    Let's begin with a binary (TRUE/FALSE) model to predict if each passenger Survived:

    r <- h2o_automl(df, y = Survived, max_models = 1, impute = FALSE, target = "TRUE")
    #> 2021-06-25 09:49:03 | Started process...
    #> - INDEPENDENT VARIABLE: Survived
    #> - MODEL TYPE: Classification
    #> # A tibble: 2 x 5
    #>   tag       n     p order  pcum
    #>   <lgl> <int> <dbl> <int> <dbl>
    #> 1 FALSE   549  61.6     1  61.6
    #> 2 TRUE    342  38.4     2 100
    #> - MISSINGS: The following variables contain missing observations: Age (19.87%). Consider using the impute parameter.
    #> - CATEGORICALS: There are 3 non-numerical features. Consider using ohse() or equivalent prior to encode categorical variables.
    #> >>> Splitting data: train = 0.7 & test = 0.3
    #> train_size  test_size 
    #>        623        268
    #> - REPEATED: There were 65 repeated rows which are being suppressed from the train dataset
    #> - ALGORITHMS: excluded 'StackedEnsemble', 'DeepLearning'
    #> - CACHE: Previous models are not being erased. You may use 'start_clean' [clear] or 'project_name' [join]
    #> - UI: You may check results using H2O Flow's interactive platform: http://localhost:54321/flow/index.html
    #> >>> Iterating until 1 models or 600 seconds...
    #> 
      |                                                                                                 
      |                                                                                           |   0%
      |                                                                                                 
      |===========================================================================================| 100%
    #> - EUREKA: Succesfully generated 1 models
    #>                           model_id       auc   logloss     aucpr mean_per_class_error      rmse
    #> 1 XGBoost_1_AutoML_20210625_094904 0.8567069 0.4392284 0.8310891            0.2060487 0.3718377
    #>         mse
    #> 1 0.1382633
    #> SELECTED MODEL: XGBoost_1_AutoML_20210625_094904
    #> - NOTE: The following variables were the least important: Embarked.S, Pclass.2, Embarked.C
    #> >>> Running predictions for Survived...
    #> Target value: TRUE
    #> >>> Generating plots...
    #> Model (1/1): XGBoost_1_AutoML_20210625_094904
    #> Independent Variable: Survived
    #> Type: Classification (2 classes)
    #> Algorithm: XGBOOST
    #> Split: 70% training data (of 891 observations)
    #> Seed: 0
    #> 
    #> Test metrics: 
    #>    AUC = 0.87654
    #>    ACC = 0.17164
    #>    PRC = 0.18421
    #>    TPR = 0.34314
    #>    TNR = 0.066265
    #> 
    #> Most important variables:
    #>    Sex.female (29.2%)
    #>    Fare (26.0%)
    #>    Age (20.5%)
    #>    Pclass.3 (8.3%)
    #>    Sex.male (4.1%)
    #> Process duration: 7.86s
    

    Let's take a look at the plots generated into a single dashboard:

    plot(r)
    

    plot of chunk unnamed-chunk-1

    We also have several calculations for our model's performance that may come useful such as a confusion matrix, gain and lift by percentile, area under the curve (AUC), accuracy (ACC), recall or true positive rate (TPR), cross-validation metrics, exact thresholds to maximize each metric, and others:

    r$metrics
    #> $dictionary
    #> [1] "AUC: Area Under the Curve"                                                             
    #> [2] "ACC: Accuracy"                                                                         
    #> [3] "PRC: Precision = Positive Predictive Value"                                            
    #> [4] "TPR: Sensitivity = Recall = Hit rate = True Positive Rate"                             
    #> [5] "TNR: Specificity = Selectivity = True Negative Rate"                                   
    #> [6] "Logloss (Error): Logarithmic loss [Neutral classification: 0.69315]"                   
    #> [7] "Gain: When best n deciles selected, what % of the real target observations are picked?"
    #> [8] "Lift: When best n deciles selected, how much better than random is?"                   
    #> 
    #> $confusion_matrix
    #>        Pred
    #> Real    FALSE TRUE
    #>   FALSE    11  155
    #>   TRUE     67   35
    #> 
    #> $gain_lift
    #> # A tibble: 10 x 10
    #>    percentile value random target total  gain optimal   lift response score
    #>    <fct>      <chr>  <dbl>  <int> <int> <dbl>   <dbl>  <dbl>    <dbl> <dbl>
    #>  1 1          TRUE    10.1     25    27  24.5    26.5 143.     24.5   95.6 
    #>  2 2          TRUE    20.5     25    28  49.0    53.9 139.     24.5   84.6 
    #>  3 3          TRUE    30.2     19    26  67.6    79.4 124.     18.6   47.8 
    #>  4 4          TRUE    40.3     12    27  79.4   100    97.1    11.8   29.5 
    #>  5 5          TRUE    50        7    26  86.3   100    72.5     6.86  20.7 
    #>  6 6          TRUE    60.1      4    27  90.2   100    50.1     3.92  14.3 
    #>  7 7          TRUE    70.1      4    27  94.1   100    34.2     3.92   9.59
    #>  8 8          TRUE    79.9      2    26  96.1   100    20.3     1.96   7.58
    #>  9 9          TRUE    89.9      1    27  97.1   100     7.93    0.980  5.89
    #> 10 10         TRUE   100        3    27 100     100     0       2.94   3.20
    #> 
    #> $metrics
    #>       AUC     ACC     PRC     TPR      TNR
    #> 1 0.87654 0.17164 0.18421 0.34314 0.066265
    #> 
    #> $cv_metrics
    #> # A tibble: 20 x 8
    #>    metric                    mean     sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
    #>    <chr>                    <dbl>  <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
    #>  1 accuracy                 0.831 0.0539      0.84       0.816      0.856      0.895      0.75 
    #>  2 auc                      0.856 0.0561      0.906      0.787      0.894      0.889      0.805
    #>  3 err                      0.169 0.0539      0.16       0.184      0.144      0.105      0.25 
    #>  4 err_count               21     6.67       20         23         18         13         31    
    #>  5 f0point5                 0.788 0.0958      0.788      0.745      0.846      0.905      0.654
    #>  6 f1                       0.777 0.0764      0.821      0.676      0.827      0.847      0.716
    #>  7 f2                       0.774 0.0911      0.858      0.619      0.808      0.796      0.789
    #>  8 lift_top_group           2.62  0.287       2.40       3.05       2.31       2.64       2.70 
    #>  9 logloss                  0.439 0.0670      0.376      0.491      0.406      0.395      0.529
    #> 10 max_per_class_error      0.270 0.0924      0.192      0.415      0.204      0.234      0.308
    #> 11 mcc                      0.651 0.105       0.684      0.565      0.705      0.779      0.522
    #> 12 mean_per_class_accuracy  0.818 0.0512      0.846      0.757      0.849      0.870      0.770
    #> 13 mean_per_class_error     0.182 0.0512      0.154      0.243      0.151      0.130      0.230
    #> 14 mse                      0.138 0.0264      0.114      0.156      0.126      0.120      0.176
    #> 15 pr_auc                   0.827 0.0837      0.895      0.744      0.886      0.884      0.727
    #> 16 precision                0.799 0.122       0.767      0.8        0.86       0.947      0.619
    #> 17 r2                       0.410 0.130       0.531      0.293      0.486      0.491      0.247
    #> 18 recall                   0.776 0.116       0.885      0.585      0.796      0.766      0.848
    #> 19 rmse                     0.371 0.0349      0.338      0.395      0.355      0.346      0.419
    #> 20 specificity              0.861 0.112       0.808      0.929      0.901      0.974      0.692
    #> 
    #> $max_metrics
    #>                         metric  threshold       value idx
    #> 1                       max f1 0.28890845   0.7490637 224
    #> 2                       max f2 0.21783681   0.8062016 252
    #> 3                 max f0point5 0.64448303   0.8105023 111
    #> 4                 max accuracy 0.61486661   0.8170144 117
    #> 5                max precision 0.99179381   1.0000000   0
    #> 6                   max recall 0.02130460   1.0000000 399
    #> 7              max specificity 0.99179381   1.0000000   0
    #> 8             max absolute_mcc 0.61486661   0.6115356 117
    #> 9   max min_per_class_accuracy 0.33269805   0.7859008 207
    #> 10 max mean_per_class_accuracy 0.31330019   0.7939785 214
    #> 11                     max tns 0.99179381 383.0000000   0
    #> 12                     max fns 0.99179381 239.0000000   0
    #> 13                     max fps 0.03076078 383.0000000 398
    #> 14                     max tps 0.02130460 240.0000000 399
    #> 15                     max tnr 0.99179381   1.0000000   0
    #> 16                     max fnr 0.99179381   0.9958333   0
    #> 17                     max fpr 0.03076078   1.0000000 398
    #> 18                     max tpr 0.02130460   1.0000000 399
    

    The same goes for the plots generated for these metrics. We have the gains and response plots on test data-set, confusion matrix, and ROC curves.

    r$plots$metrics
    #> $gains
    #> Warning: Removed 1 rows containing missing values (geom_label).
    

    plot of chunk unnamed-chunk-3

    #> 
    #> $response
    

    plot of chunk unnamed-chunk-3

    #> 
    #> $conf_matrix
    

    plot of chunk unnamed-chunk-3

    #> 
    #> $ROC
    

    plot of chunk unnamed-chunk-3

    For all models, regardless of their type (classification or regression), you can check the importance of each variable as well:

    head(r$importance)
    #>     variable relative_importance scaled_importance importance
    #> 1 Sex.female           205.62099         1.0000000 0.29225814
    #> 2       Fare           182.91312         0.8895644 0.25998245
    #> 3        Age           144.42017         0.7023610 0.20527073
    #> 4   Pclass.3            58.04853         0.2823084 0.08250692
    #> 5   Sex.male            29.17109         0.1418683 0.04146216
    #> 6      Parch            28.74764         0.1398089 0.04086028
    
    r$plots$importance
    

    plot of chunk unnamed-chunk-4

    Classification: Multi-Categorical

    Now, let's run a multi-categorical (+2 labels) model to predict Pclass of each passenger:

    r <- h2o_automl(df, Pclass, ignore = c("Fare", "Cabin"), max_time = 30, plots = FALSE)
    #> 2021-06-25 09:49:36 | Started process...
    #> - INDEPENDENT VARIABLE: Pclass
    #> - MODEL TYPE: Classification
    #> # A tibble: 3 x 5
    #>   tag       n     p order  pcum
    #>   <fct> <int> <dbl> <int> <dbl>
    #> 1 n_3     491  55.1     1  55.1
    #> 2 n_1     216  24.2     2  79.4
    #> 3 n_2     184  20.6     3 100
    #> - MISSINGS: The following variables contain missing observations: Age (19.87%). Consider using the impute parameter.
    #> - CATEGORICALS: There are 3 non-numerical features. Consider using ohse() or equivalent prior to encode categorical variables.
    #> >>> Splitting data: train = 0.7 & test = 0.3
    #> train_size  test_size 
    #>        623        268
    #> - REPEATED: There were 65 repeated rows which are being suppressed from the train dataset
    #> - ALGORITHMS: excluded 'StackedEnsemble', 'DeepLearning'
    #> - CACHE: Previous models are not being erased. You may use 'start_clean' [clear] or 'project_name' [join]
    #> - UI: You may check results using H2O Flow's interactive platform: http://localhost:54321/flow/index.html
    #> >>> Iterating until 3 models or 30 seconds...
    #> 
      |                                                                                                 
      |                                                                                           |   0%
      |                                                                                                 
      |====                                                                                       |   4%
      |                                                                                                 
      |========                                                                                   |   9%
      |                                                                                                 
      |===========================================================================================| 100%
    #> - EUREKA: Succesfully generated 3 models
    #>                           model_id mean_per_class_error   logloss      rmse       mse auc aucpr
    #> 1 XGBoost_2_AutoML_20210625_094936            0.4764622 0.8245831 0.5384408 0.2899185 NaN   NaN
    #> 2 XGBoost_1_AutoML_20210625_094936            0.4861030 0.8451478 0.5422224 0.2940051 NaN   NaN
    #> 3 XGBoost_3_AutoML_20210625_094936            0.4904451 0.8522329 0.5440560 0.2959969 NaN   NaN
    #> SELECTED MODEL: XGBoost_2_AutoML_20210625_094936
    #> - NOTE: The following variables were the least important: Sex.male, Embarked.Q
    #> >>> Running predictions for Pclass...
    #> Model (1/3): XGBoost_2_AutoML_20210625_094936
    #> Independent Variable: Pclass
    #> Type: Classification (3 classes)
    #> Algorithm: XGBOOST
    #> Split: 70% training data (of 891 observations)
    #> Seed: 0
    #> 
    #> Test metrics: 
    #>    AUC = 0.78078
    #>    ACC = 0.68284
    #> 
    #> Most important variables:
    #>    Age (52.0%)
    #>    Survived.FALSE (11.9%)
    #>    Embarked.C (7.2%)
    #>    Survived.TRUE (5.4%)
    #>    Sex.female (5.4%)
    #> Process duration: 9.79s
    

    Let's take a look at the plots generated into a single dashboard:

    plot(r)
    

    plot of chunk unnamed-chunk-5

    Regression

    Finally, a regression model with continuous values to predict Fare payed by passenger:

    r <- h2o_automl(df, y = "Fare", ignore = "Pclass", exclude_algos = NULL, quiet = TRUE)
    print(r)
    #> Model (1/4): StackedEnsemble_AllModels_AutoML_20210625_094950
    #> Independent Variable: Fare
    #> Type: Regression
    #> Algorithm: STACKEDENSEMBLE
    #> Split: 70% training data (of 871 observations)
    #> Seed: 0
    #> 
    #> Test metrics: 
    #>    rmse = 20.309
    #>    mae = 14.244
    #>    mape = 0.07304
    #>    mse = 412.45
    #>    rsq = 0.3169
    #>    rsqa = 0.3143
    

    Let's take a look at the plots generated into a single dashboard:

    plot(r)
    

    plot of chunk unnamed-chunk-6

    Export models and results

    Once you have you model trained and picked, you can export the model and it's results, so you can put it to work in a production environment (doesn't have to be R). There is a function that does all that for you: export_results(). Simply pass your h2o_automl list object into this function and that's it! You can select which formats will be exported using the which argument. Currently we support: txt, csv, rds, binary, mojo [best format for production], and plots. There are also 2 quick options (dev and production) to export some or all the files. Lastly, you can set a custom subdir to gather everything into a new sub-directory; I'd recommend using the model's name or any other convention that helps you know which one's which.

    Import and use your models

    If you'd like to re-use your exported models to predict new datasets, you have several options:

    • h2o_predict_MOJO() [recommended]: This function lets the user predict using h2o's .zip file containing the MOJO files. These files are also the ones used when putting the model into production on any other environment. Also, MOJO let's you change h2o's versions without issues
    • h2o_predict_binary(): This function lets the user predict using the h2o binary file. The h2o version/build must match for it to work.
    • h2o_predict_model(): This function lets the user run predictions from a H2O Model Object same as you'd use the predict base function. Will probably only work in your current session as you must have the actual trained object to use it.

    Complementary Posts

    Related Post

    ]]>
    https://datascienceplus.com/real-plug-and-play-supervised-learning-automl-using-r-and-lares/feed/ 0
    Text processing and word stemming for classification models in master data management (MDM) context in R https://datascienceplus.com/text-processing-and-word-stemming-for-classification-models-in-master-data-management-mdm-context-in-r/ https://datascienceplus.com/text-processing-and-word-stemming-for-classification-models-in-master-data-management-mdm-context-in-r/#respond Mon, 04 Oct 2021 00:40:50 +0000 https://datascienceplus.com/?p=31612 Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

    Category

    Tags

    Under business conditions, narrowly specialized tasks often come across, which require a special approach because they do not fit into the standard data processing flow and constructing models. One of these tasks is the classification of new products in master data management process (MDM). Example 1 You are working in a large company (supplier) engaged […]

    Related Post

    ]]>
    Are you interested in guest posting? Publish at DataScience+ via your RStudio editor.

    Category

    Tags

    Under business conditions, narrowly specialized tasks often come across, which require a special approach because they do not fit into the standard data processing flow and constructing models. One of these tasks is the classification of new products in master data management process (MDM).

    Example 1

    You are working in a large company (supplier) engaged in the production and / or sales of products, including through wholesale intermediaries (distributors). Often your distributors have an obligation (in front of the company in which you work) regularly providing reporting on your own sales of your products – the so-called Sale Out. Not always, distributors are able to report on the sold products in your company codes, more often are their own codes and their own product names that differ from the names in your system. Accordingly, in your database there is a need to keep the table matching distributors with product codes of your account. The more distributors, the more variations of the name of the same product. If you have a large assortment portfolio, it becomes a problem that is solved by manual labor-intensive support for such matching tables when new product name variations in your accounting system are received.

    If it refers to the names of such products as to the texts of documents, and the codes of your accounting system (to which these variations are tied) to consider classes, then we obtain a task of a multiple classification of texts. Such a matching table (which operators are maintained manually) can be considered a training sample, and if it is built on it such a classification model – it would be able to reduce the complexity of the work of operators to classify the flow of new names of existing products. However, the classic approach to working with the text “as is” will not save you, it will be said just below.

    Example 2

    In the database of your company, data on sales (or prices) of products from external analytical (marketing) agencies are coming or from the parsing of third-party sites. The same product from each data source will also contain variations in writing. As part of this example, the task can be even more difficult than in Example 1 because often your company's business users have the need to analyze not only your products, but also the range of your direct competitors and, accordingly, the number of classes (reference products) to which variations are tied – sharply increases.

    What is the specificity of such a class of tasks?

    First, there are a lot of classes (in fact, how many products you have so many classes) And if in this process you have to work not only with the company's products, but also competitors, the growth of such new classes can occur every day – therefore it becomes meaningless to teach one time Model to be repeatedly used to predict new products.

    Secondly, the number of documents (different variations of the same product) in the classes are not very balanced: there may be one by one to class, and maybe more.

    Why does the classic approach of the multiple classification of texts work poorly?

    Consider the shortcomings of the classic text processing approach by steps:

    • Stop words.

    In such tasks there are no stop-words in the generally accepted concepts of any text processing package.

    • Tochenization

    In classic packages from the box, the division of text on words is based on the presence of punctuation or spaces. As part of this class task (where the length of the text field input is often limited), it is not uncommon to receive product names without spaces where words are not clearly separated, but visually on the register of numbers or other language. How to pass toochenization from the box on your favorite programming language for the name of wine “Dom.CHRISTIANmoreau0,75LPeLtr.EtFilChablis” ? (Unfortunately it's not a joke)

    Product names are not text in a classic understanding of the task (such as news from sites, services reviews or newspaper headers) which is amenable to release suffix that can be discarded. In the names of products, abbreviations are often present and the reductions of words of which are not clear how to allocate this suffix. And there are also the names of the brands from another language group (for example, the inclusion of the brands of French or Italian) that are not amenable to a normal stemming.

    • Reducing document-terms matrices.

    Often, when building “Document-Term” matrices, the package of your language offers to reduce the sparsity of matrices to remove words (columns of the matrix) with a frequency below some minimum threshold. And in classical tasks, it really helps improve quality and reduce overhead in the training of the model. But not in such tasks. Above, I have already written that the distribution of classes is not strongly balanced – it can easily be on the same product name to the class (for example, a rare and expensive brand that has sold it for the first time and while there is only one time in the training sample). The classic approach of sparsity reduction we bring the quality of the classifier.

    • Training the model.

    Usually, some kind of model is trained on texts (LibSVM, naive Bayesian classifier, neural networks, or something else) which is then used repeatedly. In this case, new classes can appear daily and the number of documents in the class can be counted as a single instance. Therefore, it makes no sense to learn one large model for a long time using- any algorithm with online training, for example, a KNN classifier with one nearest neighbor, is enough.

    Next, we will try to compare the classification of the traditional approach with the classification based on the proposed package. We will use tidytext as an auxiliary package.

    Case example

    #devtools::install_github(repo = 'https://github.com/edvardoss/abbrevTexts')
    library(abbrevTexts)
    library(tidytext) # text proccessing
    library(dplyr) # data processing
    library(stringr) # data processing
    library(SnowballC) # traditional stemming approach
    library(tm) #need only for tidytext internal purpose
    

    The package includes 2 data sets on the names of wines: the original names of wines from external data sources – “rawProducts” and the unified names of wines written in the standards for maintaining the company's master data – “standardProducts”. The rawProducts table has many spelling variations of the same product, these variations are reduced to one product in standardProducts through a many-to-one relationship on the “standartId” key column. PS Variations in the “rawProducts” table are generated programmatically, but with the maximum possible similarity to how product names come from external various sources in my experience (although somewhere I may have overdone it)

    data(rawProducts, package = 'abbrevTexts')
    head(rawProducts)
    ## # A tibble: 6 x 2
    ##   StandartId rawName                                               
    ##        <dbl> <chr>                                                 
    ## 1          1 MELOT   mouvedr.Les olivier.vinPAY.                   
    ## 2          1 melotMouvedr.LesOlivier.VinPay.                       
    ## 3          2 Carigna. grenac.les betes   Rousses igp pay.   HERAULT
    ## 4          2 carign. GrenacheLesBetes / RoussesIGP Pays -  HERAU.  
    ## 5          2 Carignan  grenac. Bete. les rousse.igp PAY.   HERAULT 
    ## 6          3 Petit / chat   Rouge -Vin  pay.
    
    data(standardProducts, package = 'abbrevTexts')
    head(standardProducts)
    ## # A tibble: 6 x 3
    ##   StandartId StandardName                                                        WineColor
    ##        <dbl> <chr>                                                               <chr>    
    ## 1          1 Melot/Mouvedre, Les Oliviers, Vin de Pays d'Oc - France / Red Wines Red Wines
    ## 2          2 Carignan/Grenache, Les Betes Rousses, IGP Pays d'Herault - France ~ Red Wines
    ## 3          3 Le Petit Chat Rouge, Vin de Pays d'Oc - France / Red Wines          Red Wines
    ## 4          4 Pinot Noir, Baron de Poce, Pierre Chainier, Loire Valley - France ~ Red Wines
    ## 5          5 Gamay, Uva Non Grata, Vin de France - France / Red Wines            Red Wines
    ## 6          6 58 Guineas Claret, Bordeaux - France / Red Wines                    Red Wines
    

    Train and test split

    set.seed(1234)
    trainSample <- sample(x = seq(nrow(rawProducts)),size = .9*nrow(rawProducts))
    testSample <- setdiff(seq(nrow(rawProducts)),trainSample)
    testSample
    ##  [1]   1   5   7   8   9  11  32  37  44  45  46  48  68  69  82 110 113 119 128 179 187
    ## [22] 190 191 194 202 213 223 241 256 260 268 271 272 283 288 292 309 344 351 376 395 407
    

    Create dataframes for 'no stemming mode' and 'traditional stemming mode'

    df <- rawProducts %>% mutate(prodId=row_number(), 
                                 rawName=str_replace_all(rawName,pattern = '\\.','. ')) %>% 
      unnest_tokens(output = word,input = rawName) %>% count(StandartId,prodId,word)
    
    df.noStem <- df %>% bind_tf_idf(term = word,document = prodId,n = n)
    
    df.SnowballStem <- df %>% mutate(wordStm=SnowballC::wordStem(word)) %>% 
      bind_tf_idf(term = wordStm,document = prodId,n = n)
    

    Create document terms matrix

    dtm.noStem <- df.noStem %>% 
      cast_dtm(document = prodId,term = word,value = tf_idf) %>% data.matrix()
    
    dtm.SnowballStem <- df.SnowballStem %>% 
      cast_dtm(document = prodId,term = wordStm,value = tf_idf) %>% data.matrix()
    

    Create knn model for 'no stemming mode' and calculate accuracy

    knn.noStem <- class::knn1(train = dtm.noStem[trainSample,],
                              test = dtm.noStem[testSample,],
                              cl = rawProducts$StandartId[trainSample])
    mean(knn.noStem==rawProducts$StandartId[testSample])
    ## [1] 0.4761905
    

    accuracy knn.noStem: 0.4761905 (47%)

    Create knn model for 'stemming mode' and calculate accuracy

    knn.SnowballStem <- class::knn1(train = dtm.SnowballStem[trainSample,],
                                   test = dtm.SnowballStem[testSample,],
                                   cl = rawProducts$StandartId[trainSample])
    mean(knn.SnowballStem==rawProducts$StandartId[testSample])
    ## [1] 0.5
    

    accuracy knn.SnowballStem: 0.5 (50%)

    abbrevTexts primer

    Below is an example on the same data but using the functions from abbrevTexts package

    Separating words by case

    df <- rawProducts %>% mutate(prodId=row_number(), 
                                 rawNameSplitted= makeSeparatedWords(rawName)) %>% 
            unnest_tokens(output = word,input = rawNameSplitted)
    print(df)
    ## # A tibble: 2,376 x 4
    ##    StandartId rawName                             prodId word   
    ##         <dbl> <chr>                                <int> <chr>  
    ##  1          1 MELOT   mouvedr.Les olivier.vinPAY.      1 melot  
    ##  2          1 MELOT   mouvedr.Les olivier.vinPAY.      1 mouvedr
    ##  3          1 MELOT   mouvedr.Les olivier.vinPAY.      1 les    
    ##  4          1 MELOT   mouvedr.Les olivier.vinPAY.      1 olivier
    ##  5          1 MELOT   mouvedr.Les olivier.vinPAY.      1 vin    
    ##  6          1 MELOT   mouvedr.Les olivier.vinPAY.      1 pay    
    ##  7          1 melotMouvedr.LesOlivier.VinPay.          2 melot  
    ##  8          1 melotMouvedr.LesOlivier.VinPay.          2 mouvedr
    ##  9          1 melotMouvedr.LesOlivier.VinPay.          2 les    
    ## 10          1 melotMouvedr.LesOlivier.VinPay.          2 olivier
    ## # ... with 2,366 more rows
    

    As you can see, the tokenization of the text was carried out correctly: not only transitions from upper and lower case when writing together are taken into account, but also punctuation marks between words written together without spaces are taken into account.

    Creating a stemming dictionary based on a training sample of words

    After a long search among different stemming implementations, I came to the conclusion that traditional methods based on the rules of the language are not suitable for such specific tasks, so I had to look for my own approach. As a result, I came to the most optimal solution, which was reduced to unsupervised learning, which is not sensitive to the text language or the degree of reduction of the available words in the training sample.

    The function takes a vector of words as input, the minimum word length for the training sample and the minimum fraction for considering the child word as an abbreviation of the parent word and then does the following:

    1. Discarding words with a length less than the set threshold
    2. Discarding words consisting of numbers
    3. Sort the words in descending order of their length
    4. For each word in the list:
    • 4.1 Filter out words that are less than the length of the current word and greater than or equal to the length of the current word multiplied by the minimum fraction
    • 4.2 Select from the list of filtered words those that are the beginning of the current word

    Let's say that we fix min.share = 0.7 At this intermediate stage (4.2), we get a parent-child table where such examples can be found:

    ##    parent child
    ## 1 bodegas bodeg
    ## 2   bodeg  bode
    

    Note that each line meets the condition that the length of the child's word is not shorter than 70% of the length of the parent's word.

    However, there may be found pairs that can not be considered as abbreviations of words because in them different parents are reduced to one child, for example:

    ##    parent child
    ## 1 bodegas bodeg
    ## 2 bodegue bodeg
    

    My function for such cases leaves only one pair.

    Let's go back to the example with unambiguous abbreviations of words

    ##    parent child
    ## 1 bodegas bodeg
    ## 2   bodeg  bode
    

    But if you look a little more closely, we see that there is a common word 'bodeg' for these 2 pairs and this word allows you to connect these pairs into one chain of abbreviations without violating our initial conditions on the length of a word to consider it an abbreviation of another word:

    bodegas->bodeg->bode

    So we come to a table of the form:

    ##    parent child terminal.child
    ## 1 bodegas bodeg           bode
    ## 2   bodeg  bode           bode
    

    Such chains can be of arbitrary length and it is possible to assemble from the found pairs into such chains recursively. Thus we come to the 5th stage of determining the final child for each participant of the constructed chain of abbreviations of words

    1. Recursively iterating through the found pairs to determine the final (terminal) child for all members of chains
    2. Return the abbreviation dictionary

    The makeAbbrStemDict function is automatically paralleled by several threads loading all the processor cores, so it is advisable to take this point into account for large volumes of texts.

    abrDict <- makeAbbrStemDict(term.vec = df$word,min.len = 3,min.share = .6)
    abrDict <- as_tibble(abrDict)
    print(abrDict) # We can see parent word, intermediate results and total result (terminal child)
    ## # A tibble: 408 x 3
    ##    parent    child    terminal.child
    ##    <chr>     <chr>    <chr>         
    ##  1 abruzz    abruz    abruz         
    ##  2 absoluto  absolut  absolu        
    ##  3 absolut   absolu   absolu        
    ##  4 aconcagua aconcagu aconcag       
    ##  5 aconcagu  aconcag  aconcag       
    ##  6 agricola  agricol  agrico        
    ##  7 agricol   agrico   agrico        
    ##  8 alabastro alabast  alabast       
    ##  9 albarino  albarin  albari        
    ## 10 albarin   albari   albari        
    ## # ... with 398 more rows
    

    The output of the stemming dictionary in the form of a table is also convenient because it is possible to selectively and in a simple way in the “dplyr” paradigm to delete some of the stemming lines.

    Lets say that we wont to exclude parent word “abruzz” and terminal child group “absolu” from stemming dictionary:

    abrDict.reduced <- abrDict %>% filter(parent!='abruzz',terminal.child!='absolu')
    print(abrDict.reduced)
    ## # A tibble: 405 x 3
    ##    parent     child     terminal.child
    ##    <chr>      <chr>     <chr>         
    ##  1 aconcagua  aconcagu  aconcag       
    ##  2 aconcagu   aconcag   aconcag       
    ##  3 agricola   agricol   agrico        
    ##  4 agricol    agrico    agrico        
    ##  5 alabastro  alabast   alabast       
    ##  6 albarino   albarin   albari        
    ##  7 albarin    albari    albari        
    ##  8 alentejano alentejan alentejan     
    ##  9 alianca    alianc    alian         
    ## 10 alianc     alian     alian         
    ## # ... with 395 more rows
    

    Compare the simplicity and clarity of this solution with what is offered in stackoverflow:

    Text-mining with the tm-package – word stemming

    Stem using abbreviate dictionary

    df.AbbrStem <- df %>% left_join(abrDict %>% select(parent,terminal.child),by = c('word'='parent')) %>% 
        mutate(wordAbbrStem=coalesce(terminal.child,word)) %>% select(-terminal.child)
    print(df.AbbrStem)
    ## # A tibble: 2,376 x 5
    ##    StandartId rawName                             prodId word    wordAbbrStem
    ##         <dbl> <chr>                                <int> <chr>   <chr>       
    ##  1          1 MELOT   mouvedr.Les olivier.vinPAY.      1 melot   melo        
    ##  2          1 MELOT   mouvedr.Les olivier.vinPAY.      1 mouvedr mouvedr     
    ##  3          1 MELOT   mouvedr.Les olivier.vinPAY.      1 les     les         
    ##  4          1 MELOT   mouvedr.Les olivier.vinPAY.      1 olivier olivie      
    ##  5          1 MELOT   mouvedr.Les olivier.vinPAY.      1 vin     vin         
    ##  6          1 MELOT   mouvedr.Les olivier.vinPAY.      1 pay     pay         
    ##  7          1 melotMouvedr.LesOlivier.VinPay.          2 melot   melo        
    ##  8          1 melotMouvedr.LesOlivier.VinPay.          2 mouvedr mouvedr     
    ##  9          1 melotMouvedr.LesOlivier.VinPay.          2 les     les         
    ## 10          1 melotMouvedr.LesOlivier.VinPay.          2 olivier olivie      
    ## # ... with 2,366 more rows
    

    The output of the stemming dictionary in the form of a table is also convenient because it is possible to selectively and in a simple way in the “dplyr” paradigm to delete some of the stemming lines.
    To example:

    TF-IDF for stemmed words

    df.AbbrStem <- df.AbbrStem %>% count(StandartId,prodId,wordAbbrStem) %>% 
      bind_tf_idf(term = wordAbbrStem,document = prodId,n = n)
    print(df.AbbrStem)
    ## # A tibble: 2,289 x 7
    ##    StandartId prodId wordAbbrStem     n    tf   idf tf_idf
    ##         <dbl>  <int> <chr>        <int> <dbl> <dbl>  <dbl>
    ##  1          1      1 les              1 0.167  2.78  0.463
    ##  2          1      1 melo             1 0.167  4.94  0.823
    ##  3          1      1 mouvedr          1 0.167  5.34  0.891
    ##  4          1      1 olivie           1 0.167  4.25  0.708
    ##  5          1      1 pay              1 0.167  3.15  0.525
    ##  6          1      1 vin              1 0.167  2.78  0.463
    ##  7          1      2 les              1 0.167  2.78  0.463
    ##  8          1      2 melo             1 0.167  4.94  0.823
    ##  9          1      2 mouvedr          1 0.167  5.34  0.891
    ## 10          1      2 olivie           1 0.167  4.25  0.708
    ## # ... with 2,279 more rows
    

    Create document terms matrix

    dtm.AbbrStem <- df.AbbrStem %>% 
      cast_dtm(document = prodId,term = wordAbbrStem,value = tf_idf) %>% data.matrix()
    

    Create knn model for 'abbrevTexts mode' and calculate accuracy

    knn.AbbrStem <- class::knn1(train = dtm.AbbrStem[trainSample,],
                                    test = dtm.AbbrStem[testSample,],
                                    cl = rawProducts$StandartId[trainSample])
    mean(knn.AbbrStem==rawProducts$StandartId[testSample]) 
    ## [1] 0.8333333
    

    accuracy knn.AbbrStem: 0.8333333 (83%)

    As you can see , we have received significant improvements in the quality of classification in the test sample. Tidytext is a convenient package for a small courpus of texts, but in the case of a large courpus of texts, the “AbbrevTexts” package is also perfectly suitable for preprocessing and normalization and usually gives better accuracy in such specific tasks compared to the traditional approach.

    Link to github: https://github.com/edvardoss/abbrevTexts

    Related Post

    ]]>
    https://datascienceplus.com/text-processing-and-word-stemming-for-classification-models-in-master-data-management-mdm-context-in-r/feed/ 0